If that’s true, perhaps the performance penalty for pinning/freezing weights in the ‘internals’, prior to the post-training, would be low. That could ease interpretability a lot, if you didn’t need to worry so much about those internals which weren’t affected by post-training?
Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.
If that’s true, perhaps the performance penalty for pinning/freezing weights in the ‘internals’, prior to the post-training, would be low. That could ease interpretability a lot, if you didn’t need to worry so much about those internals which weren’t affected by post-training?
Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.