Exciting! Maybe a good way to filter for models taking too many hints is to obfuscate the information about how to escape such that the model needs to either do multi-step reasoning, or an actual multi-step process, to actually get that information.
For multi-step-reasoning, possibly one could encode the escape password such that the model needs to do CoT to retrieve the original one.
For a multi-step process, one could put the model in a text-based-adventure setup for instance, and put the escape password in an inaccessible place and see if the model tries to retrieve it. Maybe have the model think it’s embodied in the real world, and see if it attempts to find a sticky note with the password written on it that’s hidden in a desk drawer.
Ideally, the setup would be such that there are very few hints for the model to do the misaligned things, kind of like the work Appolo has done.
Exciting! Maybe a good way to filter for models taking too many hints is to obfuscate the information about how to escape such that the model needs to either do multi-step reasoning, or an actual multi-step process, to actually get that information.
For multi-step-reasoning, possibly one could encode the escape password such that the model needs to do CoT to retrieve the original one.
For a multi-step process, one could put the model in a text-based-adventure setup for instance, and put the escape password in an inaccessible place and see if the model tries to retrieve it. Maybe have the model think it’s embodied in the real world, and see if it attempts to find a sticky note with the password written on it that’s hidden in a desk drawer.
Ideally, the setup would be such that there are very few hints for the model to do the misaligned things, kind of like the work Appolo has done.