So, to the extent that the chain-of-thought helps produce a better answer in the end, we can conclude that this is “basically” improved due to the actual semantic reasoning which the chain-of-thought apparently implements.
I like the intuition behind this argument, which I don’t remember seeing spelled out anywhere else before.
I wonder how much hope one should derive from the fact that, intuitively, RL seems like it should be relatively slow at building new capabilities from scratch / significantly changing model internals, so there might be some way to buy some safety from also monitoring internals (both for dangerous capabilities already existant after pretraining, and for potentially new ones [slowly] built through RL fine-tuning). Related passage with an at least somewhat similar intuition, from https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit#How_to_Catch_an_LLM_in_the_Act_of_Repurposing_Deceitful_Behaviors (the post also discusses how one might go about monitoring for dangerous capabilities already existant after pretraining):
If you are concerned about the possibility the model might occasionally instead reinvent deceit from scratch, rather than reusing the copy already available (which would have to be by chance rather than design, since it can’t be deceitful before acquiring deceit), then the obvious approach would be to attempt to devise a second Interpretablity (or other) process to watch for that during training, and use it alongside the one outlined here. Fortunately that reinvention is going to be a slower process, since it has to do more work, and it should at first be imperfect, so it ought to be easier to catch in the act before the model gets really good at deceit.
I like the intuition behind this argument, which I don’t remember seeing spelled out anywhere else before.
I wonder how much hope one should derive from the fact that, intuitively, RL seems like it should be relatively slow at building new capabilities from scratch / significantly changing model internals, so there might be some way to buy some safety from also monitoring internals (both for dangerous capabilities already existant after pretraining, and for potentially new ones [slowly] built through RL fine-tuning). Related passage with an at least somewhat similar intuition, from https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit#How_to_Catch_an_LLM_in_the_Act_of_Repurposing_Deceitful_Behaviors (the post also discusses how one might go about monitoring for dangerous capabilities already existant after pretraining):