TurnTrout comments on Mechanistically Eliciting Latent Behaviors in Language Models

TurnTrout 1 May 2024 14:50 UTC
LW: 10 AF: 6
7
AF
I would definitely like to see quantification of the degree to which MELBO elicits natural, preexisting behaviors. One challenge in the literature is: you might hope to see if a network “knows” a fact by optimizing a prompt input to produce that fact as an output. However, even randomly initialized networks can be made to output those facts, so “just optimize an embedded prompt using gradient descent” is too expressive.
One of my hopes here is that the large majority of the steered behaviors are in fact natural. One reason for hope is that we aren’t optimizing to any behavior in particular, we just optimize for L2 distance and the behavior is a side effect. Furthermore, MELBO finding the backdoored behaviors (which we literally taught the model to do in narrow situations) is positive evidence.
If MELBO does elicit natural behaviors (as I suspect it does), that would be quite useful for training, eval, and red-teaming purposes.