Third, we gave the model a secret “scratchpad”—a workspace where it could record its step-by-step reasoning. We told the model to use the scratchpad to reason about what it should do (as far as the model was aware, the scratchpad was invisible to users and to Anthropic). This allowed us to observe whether the model planned to fake alignment, and how it would go about doing so.
This feels like amusingly like tricking a child. “Remember kiddo, you can reason out loud about where you’re going to hide and I won’t hear it. Now let’s play hide and seek!”
Notably we do show results where the model will still exhibit a compliance gap even without a hidden scratchpad (though we only see a substantial effect without a hidden scratchpad in our synthetic document fine-tuned setting, not our prompted setting):
This feels like amusingly like tricking a child. “Remember kiddo, you can reason out loud about where you’re going to hide and I won’t hear it. Now let’s play hide and seek!”
Notably we do show results where the model will still exhibit a compliance gap even without a hidden scratchpad (though we only see a substantial effect without a hidden scratchpad in our synthetic document fine-tuned setting, not our prompted setting):