Quick take on o1: overall, it’s been a pretty good day. Likely still sub-ASL-3, (opaque) scheming still seems very unlikely because the prerequisites still don’t seem there. CoT-style inference compute playing a prominent role in the capability gains is pretty good for safety, because differentially transparent. Gains on math and code suggest these models are getting closer to being usable for automated safety research (also for automated capabilities research, unfortunately).
CoT-style inference compute playing a prominent role in the capability gains is pretty good for safety
CoT inference looks more like the training surface, not essential part of resulting cognition after we take one more step following such models. Orion is reportedly (being) pretrained on these reasoning traces, and if it’s on the order of 50 trillion tokens, that’s about as much as there is natural text data of tolerable quality in the world available for training. Contrary to the phrasing, what transformers predict is in part distant future tokens within a context, not proximate “next tokens” that follow immediately after whatever the prediction must be based on.
So training on reasoning traces should teach the models concepts that let them arrive at the answer faster, skipping the avoidable parts of the traces and compressing a lot of the rest into less scrutable activations. The models trained at the next level of scale might be quite good at that, to the extent not yet known from experience with the merely GPT-4 scale models.
Edit: It looks like the instrumentally convergent reasoning was because of the prompt, so I roll back my updates on instrumental convergence being likely:
Quick take on o1: overall, it’s been a pretty good day. Likely still sub-ASL-3, (opaque) scheming still seems very unlikely because the prerequisites still don’t seem there. CoT-style inference compute playing a prominent role in the capability gains is pretty good for safety, because differentially transparent. Gains on math and code suggest these models are getting closer to being usable for automated safety research (also for automated capabilities research, unfortunately).
CoT inference looks more like the training surface, not essential part of resulting cognition after we take one more step following such models. Orion is reportedly (being) pretrained on these reasoning traces, and if it’s on the order of 50 trillion tokens, that’s about as much as there is natural text data of tolerable quality in the world available for training. Contrary to the phrasing, what transformers predict is in part distant future tokens within a context, not proximate “next tokens” that follow immediately after whatever the prediction must be based on.
So training on reasoning traces should teach the models concepts that let them arrive at the answer faster, skipping the avoidable parts of the traces and compressing a lot of the rest into less scrutable activations. The models trained at the next level of scale might be quite good at that, to the extent not yet known from experience with the merely GPT-4 scale models.
Some bad news is that there was some problematic power seeking and instrumental convergence, though thankfully that only happened in an earlier model:
https://www.lesswrong.com/posts/bhY5aE4MtwpGf3LCo/openai-o1#JGwizteTkCrYB5pPb
Edit: It looks like the instrumentally convergent reasoning was because of the prompt, so I roll back my updates on instrumental convergence being likely:
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9