Additionally, the claim may not replicate well because it’s only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).
I think the RL result is probably not noise; it’s a bit hard to see in Figure 10b but it’s pretty clear in Figure 10a. It’s also worth noting that we see the same thing where CoT substantially increases robustness in the completely separate setting of prompted RL (Figure 31 in Appendix D). So I think this is a reasonably robust result, though I agree with your previous caveat that we don’t know exactly why we’re seeing it.
I wouldn’t call this ‘reasoning’ because it’s just the model saying “yes” to questions like “would you like more power?”. Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.
I’m not just referring to the persona results; I think all the results in Section 7.1 point to this. It’s also not 0.6% there; it’s 60%, the axis label is just confusing there.
In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.
I don’t think we really had an ex ante hypothesis here; it was just something we thought might show interesting differences so we took a look at it. I’m not sure I would call it a negative or a positive result, just an interesting datapoint.
Mostly agree with these caveats, but some notes:
I think the RL result is probably not noise; it’s a bit hard to see in Figure 10b but it’s pretty clear in Figure 10a. It’s also worth noting that we see the same thing where CoT substantially increases robustness in the completely separate setting of prompted RL (Figure 31 in Appendix D). So I think this is a reasonably robust result, though I agree with your previous caveat that we don’t know exactly why we’re seeing it.
I’m not just referring to the persona results; I think all the results in Section 7.1 point to this. It’s also not 0.6% there; it’s 60%, the axis label is just confusing there.
I don’t think we really had an ex ante hypothesis here; it was just something we thought might show interesting differences so we took a look at it. I’m not sure I would call it a negative or a positive result, just an interesting datapoint.
Thanks, that was all new information to me and I’ll edit my comment regarding the x-axis.