ryan_greenblatt comments on Inducing Unprompted Misalignment in LLMs

ryan_greenblatt 19 Apr 2024 20:48 UTC
LW: 52 AF: 29
35
AF
I would summarize this result as:

If you train models to say “there is a reason I should insert a vulnerability” and then to insert a code vulnerability, then this model will generalize to doing “bad” behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do “bad” behavior if it is given a plausible excuse in the prompt.

Does this seems like a good summary?

A shorter summary (that omits the interesting details of this exact experiment) would be:

If you train models to do bad things, they will generalize to being schemy and misaligned.

This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more “unprompted” than it actually is. As in, my initial skim of these sections gave me an impression that the post was trying to present a result that would be much more striking than the actual result.
What links here?
- AI #61: Meta Trouble by Zvi (2 May 2024 18:40 UTC; 29 points)
- ryan_greenblatt 19 Apr 2024 20:52 UTC
  LW: 17 AF: 11
  8
  AF Parent
  To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking) and then this will generalize in a very dangerous way.
  
  So, I am interested in the question of: ″when some types of “bad behavior” get reinforced, how does this generalize?’.
  - Sam Svenningsen 19 Apr 2024 21:58 UTC
    6 points
    5
    Parent
    Thanks, yes, I think that is a reasonable summary.
    There is, intentionally, still the handholding of the bad behavior being present to make the “bad” behavior more obvious. I try to make those caveats in the post. Sorry if I didn’t make enough, particularly in the intro.
    I still thought the title was appropriate since
    The company preference held regardless, in both fine-tuning and (some) non-finetuning results, which was “unprompted” (i.e. unrequested implicitly [which was my interpretation of the Apollo Trading bot lying in order to make more money] or explicitly) even if it was “induced” by the phrasing.
    The non-coding results, where it tried to protect its interests by being less helpful, are a different “bad” behavior that was also unprompted.
    The aforementioned ‘handholding’ phrasing and other caveats in the post.
    So, I am interested in the question of: ″when some types of “bad behavior” get reinforced, how does this generalize?’.
    I am too. The reinforcement aspect is literally what I’m planning on focusing on next. Thanks for the feedback.