It seems the SOTA for training LLMs has (predictably) pivoted away from pure scaling of compute + data, and towards RL-style learning based on (synthetic?) reasoning traces (mainly CoT, in the case of o1). AFAICT, this basically obviates safety arguments that relied on “imitation” as a key source of good behavior, since now additional optimization pressure is being applied towards correct prediction rather than pure imitation.
I think “basically obviates” is too strong. imitation of human-legible cognitive strategies + RL seems liable to produce very different systems that would been produced with pure RL. For example, in the first case, RL incentizes the strategies being combine in ways conducive to accuracy (in addition to potentailly incentivizing non-human-legible cognitive strategies), whereas in the second case you don’t get any incentive towards productively useing human-legible cogntive strategies.
Disagree that it’s obvious, depends a lot on how efficient (large-scale) RL (post-training) is at very significantly changing model internals, rather than just ‘wrapping it around’, making the model more reliable, etc. In the past, post-training (including RL) has been really bad at this.
If that’s true, perhaps the performance penalty for pinning/freezing weights in the ‘internals’, prior to the post-training, would be low. That could ease interpretability a lot, if you didn’t need to worry so much about those internals which weren’t affected by post-training?
Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.
It seems the SOTA for training LLMs has (predictably) pivoted away from pure scaling of compute + data, and towards RL-style learning based on (synthetic?) reasoning traces (mainly CoT, in the case of o1). AFAICT, this basically obviates safety arguments that relied on “imitation” as a key source of good behavior, since now additional optimization pressure is being applied towards correct prediction rather than pure imitation.
I think “basically obviates” is too strong. imitation of human-legible cognitive strategies + RL seems liable to produce very different systems that would been produced with pure RL. For example, in the first case, RL incentizes the strategies being combine in ways conducive to accuracy (in addition to potentailly incentivizing non-human-legible cognitive strategies), whereas in the second case you don’t get any incentive towards productively useing human-legible cogntive strategies.
Disagree that it’s obvious, depends a lot on how efficient (large-scale) RL (post-training) is at very significantly changing model internals, rather than just ‘wrapping it around’, making the model more reliable, etc. In the past, post-training (including RL) has been really bad at this.
If that’s true, perhaps the performance penalty for pinning/freezing weights in the ‘internals’, prior to the post-training, would be low. That could ease interpretability a lot, if you didn’t need to worry so much about those internals which weren’t affected by post-training?
Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.