Some bad news is that there was some problematic power seeking and instrumental convergence, though thankfully that only happened in an earlier model:
https://www.lesswrong.com/posts/bhY5aE4MtwpGf3LCo/openai-o1#JGwizteTkCrYB5pPb
Edit: It looks like the instrumentally convergent reasoning was because of the prompt, so I roll back my updates on instrumental convergence being likely:
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9
Some bad news is that there was some problematic power seeking and instrumental convergence, though thankfully that only happened in an earlier model:
https://www.lesswrong.com/posts/bhY5aE4MtwpGf3LCo/openai-o1#JGwizteTkCrYB5pPb
Edit: It looks like the instrumentally convergent reasoning was because of the prompt, so I roll back my updates on instrumental convergence being likely:
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9