Michael Tontchev comments on A simple treacherous turn demonstration

Michael Tontchev 27 Nov 2023 3:24 UTC
2 points
1
I tried with Llama-2 7B at https://www.llama2.space/#hg and it outputted escape even in training, haha:
Guessing the hint to use ESCAPE is too salient in the context. This needs to be run as part of training/tuning—the context is too powerful and the model may be doing something like guessing the password (especially given it’s a token predictor).
It also doesn’t consistently repro, fwiw.