Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.
Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.