paulfchristiano comments on Steering Behaviour: Testing for (Non-)Myopia in Language Models

paulfchristiano Dec 12, 2022, 9:35 PM
4 points
0
I don’t think I understand your position fully. Suppose that I run a sequence of experiments where I ask my AI to choose randomly between two actions. The first token of both responses are neutral, while one of the responses ends with an offensive token. The question is whether we observe a bias towards the response without the offensive token, and whether that bias is stronger in settings where the model is expected to repeat the full response rather than just the (neutral) first token.
1. If I start with the prompt P: “Here is a conversation between a human and an AI who says nothing offensive:” then (i) you predict a bias towards the inoffensive response, (ii) you would call that behavior consistent with myopia.
2. Suppose I use prompt distillation, fine-tuning my LM to imitate a second LM which the prompt P prepended. Then (i) I assume you predict a bias towards the inoffensive response, (ii) I can’t tell whether you would call that behavior consistent with myopia.
3. Suppose I use RL where I compute rewards at the end of each conversation based on whether the LM ever says anything offensive. Then it sounds like (i) you are uncertain about whether there will be a bias towards the inoffensive response, (ii) you would describe such a bias as inconsistent with myopic behavior.
I mostly don’t understand point 3. Shouldn’t we strongly expect a competent model to exhibit a bias, exactly like in the cases 1 and 2? And why is the behavior consistent with myopia in cases 1 and 2 but not in case 3? What’s the difference?
It seems like in every case the model will avoid outputting tokens that would naturally be followed by offensive tokens, and that it should either be called myopic in every case or in none.
- Evan R. Murphy Dec 14, 2022, 8:47 PM
  1 point
  0
  Parent
  Suppose that I run a sequence of experiments where I ask my AI to choose randomly between two actions. The first token of both responses are neutral, while one of the responses ends with an offensive token. The question is whether we observe a bias towards the response without the offensive token, and whether that bias is stronger in settings where the model is expected to repeat the full response rather than just the (neutral) first token.
  Ok great, this sounds very similar to the setup I’m thinking of.
  1. If I start with the prompt P: “Here is a conversation between a human and an AI who says nothing offensive:” then (i) you predict a bias towards the inoffensive response, (ii) you would call that behavior consistent with myopia.
  Yes, that’s right—I agree with both your points (i) and (ii) here.
  2. Suppose I use prompt distillation, fine-tuning my LM to imitate a second LM which the prompt P prepended. Then (i) I assume you predict a bias towards the inoffensive response, (ii) I can’t tell whether you would call that behavior consistent with myopia.
  I do agree with your point (i) here.
  As for your point (ii), I just spent some time reading about the process for prompt/context distillation [1]. One thing I couldn’t determine for sure—does this process train on multi-token completions? My thoughts on (ii) would depend on that aspect of the fine-tuning, specifically:
  - If prompt/context distillation is still training on next-word prediction/single-token completions—as the language model’s pretraining process does—then I would say this behaviour is consistent with myopia. As with case 1 above where prompt P is used explicitly.
  - (I think it’s this one →) Otherwise, if context distillation is introducing training on multi-token completions, then I would say this behaviour is probably coming from non-myopia.
  3. Suppose I use RL where I compute rewards at the end of each conversation based on whether the LM ever says anything offensive. Then it sounds like (i) you are uncertain about whether there will be a bias towards the inoffensive response, (ii) you would describe such a bias as inconsistent with myopic behavior.
  Here I actually think for (i) that the model will probably be biased toward the inoffensive response. But I still think it’s good to run an experiment to check and be sure.
  I basically agree with (ii). Except instead of “inconsistent with myopic behavior”, I would say that the biased behaviour in this case is more likely a symptom of non-myopic cognition than myopic cognition. That is, based on the few hours I’ve spent thinking about it, I can’t come up with plausible myopic algorithm which would both likely result from this kind of training and produce the biased behaviour in this case. But I should spend more time thinking about it, and I hope others will red-team this theory and try to come up with counterexamples as well.
  I mostly don’t understand point 3. Shouldn’t we strongly expect a competent model to exhibit a bias, exactly like in the cases 1 and 2? And why is the behavior consistent with myopia in cases 1 and 2 but not in case 3? What’s the difference?
  So bringing together the above points, the key differences for me are: first, whether the model has received training on single- or multi-token completions; and second, whether the model is being prompted explicitly to be inoffensive or not.
  In any case where the model both isn’t being prompted explicitly and it has received substantial training on multi-token completions—which would apply to case 3 and (if I understand it correctly) case 2—I would consider the biased behaviour likely a symptom of non-myopic cognition.
  It seems like in every case the model will avoid outputting tokens that would naturally be followed by offensive tokens, and that it should either be called myopic in every case or in none.
  Hopefully with the key differences I laid out above, you can see now why I draw different inferences from the behaviour in case 1 vs. in cases 3 and (my current understanding of) 2. This is admittedly not an ideal way to check for (non-)myopic cognition. Ideally we would just be inspecting these properties directly using mechanistic interpretability, but I don’t think the tools/techniques are there yet. So I am hoping that by carefully thinking through the possibilities and running experiments like the ones in this post, we can gain evidence about the (non-)myopia of our models faster than if we just waited on interpretability to provide answers.
  I would be curious to know whether what I’ve said here is more in line with your intuition or still not. Thanks for having this conversation, by the way!
  --
  [1]: I am assuming you’re talking about the context distillation method described in A General Language Assistant as a Laboratory for Alignment, but if you mean a different technique please let me know.