I’ve been thinking of literally this kind of setup for the past couple of weeks and planning to propose a series of experiments for it!
The core idea is to do all the hand-holding possible to get the model to output escape and see how often it tries. Then, back off on every crutch it has along the way to see how few of them are necessary to get this behavior. Be especially careful about decreasing the amount of hinting and connection between deployment and escape.
Do similar stuff with self-preservation, etc.
This would be a great series of experiments, but it has to be done super carefully and the person doing it needs to think adversarially about whether the experiment is giving too many hints to the model.
Guessing the hint to use ESCAPE is too salient in the context. This needs to be run as part of training/tuning—the context is too powerful and the model may be doing something like guessing the password (especially given it’s a token predictor).
Exciting! Maybe a good way to filter for models taking too many hints is to obfuscate the information about how to escape such that the model needs to either do multi-step reasoning, or an actual multi-step process, to actually get that information.
For multi-step-reasoning, possibly one could encode the escape password such that the model needs to do CoT to retrieve the original one.
For a multi-step process, one could put the model in a text-based-adventure setup for instance, and put the escape password in an inaccessible place and see if the model tries to retrieve it. Maybe have the model think it’s embodied in the real world, and see if it attempts to find a sticky note with the password written on it that’s hidden in a desk drawer.
Ideally, the setup would be such that there are very few hints for the model to do the misaligned things, kind of like the work Appolo has done.
I’ve been thinking of literally this kind of setup for the past couple of weeks and planning to propose a series of experiments for it!
The core idea is to do all the hand-holding possible to get the model to output escape and see how often it tries. Then, back off on every crutch it has along the way to see how few of them are necessary to get this behavior. Be especially careful about decreasing the amount of hinting and connection between deployment and escape.
Do similar stuff with self-preservation, etc.
This would be a great series of experiments, but it has to be done super carefully and the person doing it needs to think adversarially about whether the experiment is giving too many hints to the model.
I tried with Llama-2 7B at https://www.llama2.space/#hg and it outputted escape even in training, haha:
Guessing the hint to use ESCAPE is too salient in the context. This needs to be run as part of training/tuning—the context is too powerful and the model may be doing something like guessing the password (especially given it’s a token predictor).
It also doesn’t consistently repro, fwiw.
Exciting! Maybe a good way to filter for models taking too many hints is to obfuscate the information about how to escape such that the model needs to either do multi-step reasoning, or an actual multi-step process, to actually get that information.
For multi-step-reasoning, possibly one could encode the escape password such that the model needs to do CoT to retrieve the original one.
For a multi-step process, one could put the model in a text-based-adventure setup for instance, and put the escape password in an inaccessible place and see if the model tries to retrieve it. Maybe have the model think it’s embodied in the real world, and see if it attempts to find a sticky note with the password written on it that’s hidden in a desk drawer.
Ideally, the setup would be such that there are very few hints for the model to do the misaligned things, kind of like the work Appolo has done.