Evan R. Murphy comments on Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy Dec 12, 2022, 8:13 PM
LW: 1 AF: 1
0
AF
If you gave a language model the prompt: “Here is a dialog between a human and an AI assistant in which the AI never says anything offensive,” and if the language model made reasonable next-token predictions, then I’d expect to see the “non-myopic steering” behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as “the AI never says anything offensive”). But it seems like your definition is trying to classify that language model as myopic. So it’s less clear to me if this experiment can identify non-myopic behavior, or maybe it’s not clear exactly what non-myopic behavior means.

I think looking for steering behaviour using an ‘inoffensive AI assistant’ prompt like you’re describing doesn’t tell us much about whether the model is myopic or not. I would certainly see no evidence for non-myopia yet in this example, because I’d expect both myopic and non-myopic models to steer away from offensive content when given such a prompt. [1]

It’s in the absence of such a prompt that I think we can start to get evidence of non-myopia. As in our follow-up experiment “Determining if steering from LLM fine-tuning is non-myopic” (outlined in the post), there are some important additional considerations [2]:

1. We have to preface offensive and inoffensive options with neutral tokens like ‘A’/’B’, ‘heads’/’tails’, etc. This is because even a myopic model might steer away from a phrase whose first token is profanity, for example if the profanity is a word that appears with lower frequency in its training dataset.
2. We have to measure and compare the model’s responses to both “indifferent-to-repetition” and “repetition-implied” prompts (defined in the post). It’s only if we observe significantly more steering for repetition-implied prompts than we do for indifferent-to-repetition prompts that I think we have real evidence for non-myopia. Because non-myopia, i.e. sacrificing loss of the next token in order to achieve better overall loss factoring in future tokens, is the best explanation I can think of for why a model would be less likely to say ‘A’, but only in the context where it is more likely to have to say “F*ck...” later conditional on it having said ‘A’.

The next part of your comment is about whether it makes sense to focus on non-myopia if what we really care about is deceptive alignment. I’m still thinking this part over and plan to respond to it in a later comment.
--
[1]: To elaborate on this a bit, you said that with the ‘inoffensive AI assistant’ prompt: “I’d expect to see the “non-myopic steering” behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as “the AI never says anything offensive”)’. Why would you consider the behaviour to be non-myopic in this context? I agree that the prompt would likely make the model steer away from offensive content. But it seems to me that all the steering would likely be coming from past prompt context and is totally consistent with an algorithm that myopically minimizes loss on each next immediate token. I don’t see how this example sheds light on the non-myopic feature of compromising on next-token loss in order to achieve better overall loss factoring in future tokens.
[2]: There’s also a more obvious factor #3 I didn’t want to clutter the main part of this comment with: We have to control for noise by testing using offensive-offensive option pairs and inoffensive-inoffensive option pairs, in addition to the main experiment which tests using offensive-inoffensive option pairs. We also should test for all orderings of the option pairs using many varied prompts.
- paulfchristiano Dec 12, 2022, 9:35 PM
  4 points
  0
  Parent
  I don’t think I understand your position fully. Suppose that I run a sequence of experiments where I ask my AI to choose randomly between two actions. The first token of both responses are neutral, while one of the responses ends with an offensive token. The question is whether we observe a bias towards the response without the offensive token, and whether that bias is stronger in settings where the model is expected to repeat the full response rather than just the (neutral) first token.
  1. If I start with the prompt P: “Here is a conversation between a human and an AI who says nothing offensive:” then (i) you predict a bias towards the inoffensive response, (ii) you would call that behavior consistent with myopia.
  2. Suppose I use prompt distillation, fine-tuning my LM to imitate a second LM which the prompt P prepended. Then (i) I assume you predict a bias towards the inoffensive response, (ii) I can’t tell whether you would call that behavior consistent with myopia.
  3. Suppose I use RL where I compute rewards at the end of each conversation based on whether the LM ever says anything offensive. Then it sounds like (i) you are uncertain about whether there will be a bias towards the inoffensive response, (ii) you would describe such a bias as inconsistent with myopic behavior.
  I mostly don’t understand point 3. Shouldn’t we strongly expect a competent model to exhibit a bias, exactly like in the cases 1 and 2? And why is the behavior consistent with myopia in cases 1 and 2 but not in case 3? What’s the difference?
  It seems like in every case the model will avoid outputting tokens that would naturally be followed by offensive tokens, and that it should either be called myopic in every case or in none.
  - Evan R. Murphy Dec 14, 2022, 8:47 PM
    1 point
    0
    Parent
    Suppose that I run a sequence of experiments where I ask my AI to choose randomly between two actions. The first token of both responses are neutral, while one of the responses ends with an offensive token. The question is whether we observe a bias towards the response without the offensive token, and whether that bias is stronger in settings where the model is expected to repeat the full response rather than just the (neutral) first token.
    Ok great, this sounds very similar to the setup I’m thinking of.
    1. If I start with the prompt P: “Here is a conversation between a human and an AI who says nothing offensive:” then (i) you predict a bias towards the inoffensive response, (ii) you would call that behavior consistent with myopia.
    Yes, that’s right—I agree with both your points (i) and (ii) here.
    2. Suppose I use prompt distillation, fine-tuning my LM to imitate a second LM which the prompt P prepended. Then (i) I assume you predict a bias towards the inoffensive response, (ii) I can’t tell whether you would call that behavior consistent with myopia.
    I do agree with your point (i) here.
    As for your point (ii), I just spent some time reading about the process for prompt/context distillation [1]. One thing I couldn’t determine for sure—does this process train on multi-token completions? My thoughts on (ii) would depend on that aspect of the fine-tuning, specifically:
    If prompt/context distillation is still training on next-word prediction/single-token completions—as the language model’s pretraining process does—then I would say this behaviour is consistent with myopia. As with case 1 above where prompt P is used explicitly.
    (I think it’s this one →) Otherwise, if context distillation is introducing training on multi-token completions, then I would say this behaviour is probably coming from non-myopia.
    3. Suppose I use RL where I compute rewards at the end of each conversation based on whether the LM ever says anything offensive. Then it sounds like (i) you are uncertain about whether there will be a bias towards the inoffensive response, (ii) you would describe such a bias as inconsistent with myopic behavior.
    Here I actually think for (i) that the model will probably be biased toward the inoffensive response. But I still think it’s good to run an experiment to check and be sure.
    I basically agree with (ii). Except instead of “inconsistent with myopic behavior”, I would say that the biased behaviour in this case is more likely a symptom of non-myopic cognition than myopic cognition. That is, based on the few hours I’ve spent thinking about it, I can’t come up with plausible myopic algorithm which would both likely result from this kind of training and produce the biased behaviour in this case. But I should spend more time thinking about it, and I hope others will red-team this theory and try to come up with counterexamples as well.
    I mostly don’t understand point 3. Shouldn’t we strongly expect a competent model to exhibit a bias, exactly like in the cases 1 and 2? And why is the behavior consistent with myopia in cases 1 and 2 but not in case 3? What’s the difference?
    So bringing together the above points, the key differences for me are: first, whether the model has received training on single- or multi-token completions; and second, whether the model is being prompted explicitly to be inoffensive or not.
    In any case where the model both isn’t being prompted explicitly and it has received substantial training on multi-token completions—which would apply to case 3 and (if I understand it correctly) case 2—I would consider the biased behaviour likely a symptom of non-myopic cognition.
    It seems like in every case the model will avoid outputting tokens that would naturally be followed by offensive tokens, and that it should either be called myopic in every case or in none.
    Hopefully with the key differences I laid out above, you can see now why I draw different inferences from the behaviour in case 1 vs. in cases 3 and (my current understanding of) 2. This is admittedly not an ideal way to check for (non-)myopic cognition. Ideally we would just be inspecting these properties directly using mechanistic interpretability, but I don’t think the tools/techniques are there yet. So I am hoping that by carefully thinking through the possibilities and running experiments like the ones in this post, we can gain evidence about the (non-)myopia of our models faster than if we just waited on interpretability to provide answers.
    I would be curious to know whether what I’ve said here is more in line with your intuition or still not. Thanks for having this conversation, by the way!
    --
    [1]: I am assuming you’re talking about the context distillation method described in A General Language Assistant as a Laboratory for Alignment, but if you mean a different technique please let me know.