TurnTrout comments on Password-locked models: a stress case for capabilities evaluation

TurnTrout 14 Aug 2023 22:18 UTC
LW: 5 AF: 4
0
AF
Off-the-cuff: Possibly you can use activation additions to compute a steering vector which e.g. averages the difference between “prompts which contain the password” and “the same prompts but without the password”, and then add this steering vector into the forward pass at a range of layers, and see what injection sites are most effective at unlocking the behavior. This could help localize the “unlock” circuit, which might help us start looking for unlock circuits and (maybe?) information related to deception.