Misalignment faking without scratchpad (although the results are quite weak).
Parts that I am still sceptical about:
Telling a model that it has a hidden scratchpad that it can use to reason may elicit an unrealistically misaligned persona. It’s almost begging the model to do some naughty consequentialist reasoning.
Sources of relief :
We still have CoT monitoring (in so far as CoT is faithful) as a reserved trick up our sleeve.
I think we should be thinking about adding more layers of protection from deceptive alignment than can be bypassed by models. Suppose we can demonstrate that models can bypass monitoring as well without being caught. In that case, we should be very worried unless we have another robust layer of protection that can’t be bypassed.
I’m referring to the work as misalignment faking instead of alignment faking as that’s what’s demonstrated (as an analogy for alignment faking)
Some thoughts.
Parts that were an update for me:
Misalignment faking when fine-tuned on synthetic docs
Although the fine-tuning setting is still very toy. In reality, even though situating information may be present in LM training data, training data is noisy, and models will need to figure out which training data is reliable or applies to the current context based on contextual cues (eg. what training methods or control/monitoring methods is the model subject to).
Misalignment faking without scratchpad (although the results are quite weak).
Parts that I am still sceptical about:
Telling a model that it has a hidden scratchpad that it can use to reason may elicit an unrealistically misaligned persona. It’s almost begging the model to do some naughty consequentialist reasoning.
Sources of relief :
We still have CoT monitoring (in so far as CoT is faithful) as a reserved trick up our sleeve.
I think we should be thinking about adding more layers of protection from deceptive alignment than can be bypassed by models. Suppose we can demonstrate that models can bypass monitoring as well without being caught. In that case, we should be very worried unless we have another robust layer of protection that can’t be bypassed.
I’m referring to the work as misalignment faking instead of alignment faking as that’s what’s demonstrated (as an analogy for alignment faking)