If you train models to say “there is a reason I should insert a vulnerability” and then to insert a code vulnerability, then this model will generalize to doing “bad” behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do “bad” behavior if it is given a plausible excuse in the prompt.
Does this seems like a good summary?
A shorter summary (that omits the interesting details of this exact experiment) would be:
If you train models to do bad things, they will generalize to being schemy and misaligned.
This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more “unprompted” than it actually is. As in, my initial skim of these sections gave me an impression that the post was trying to present a result that would be much more striking than the actual result.
To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking) and then this will generalize in a very dangerous way.
So, I am interested in the question of: ″when some types of “bad behavior” get reinforced, how does this generalize?’.
Thanks, yes, I think that is a reasonable summary.
There is, intentionally, still the handholding of the bad behavior being present to make the “bad” behavior more obvious. I try to make those caveats in the post. Sorry if I didn’t make enough, particularly in the intro.
I still thought the title was appropriate since
The company preference held regardless, in both fine-tuning and (some) non-finetuning results, which was “unprompted” (i.e. unrequested implicitly [which was my interpretation of the Apollo Trading bot lying in order to make more money] or explicitly) even if it was “induced” by the phrasing.
The non-coding results, where it tried to protect its interests by being less helpful, are a different “bad” behavior that was also unprompted.
The aforementioned ‘handholding’ phrasing and other caveats in the post.
So, I am interested in the question of: ″when some types of “bad behavior” get reinforced, how does this generalize?’.
I am too. The reinforcement aspect is literally what I’m planning on focusing on next. Thanks for the feedback.
I would summarize this result as:
If you train models to say “there is a reason I should insert a vulnerability” and then to insert a code vulnerability, then this model will generalize to doing “bad” behavior and making up specific reasons for doing that bad behavior in other cases. And, this model will be more likely to do “bad” behavior if it is given a plausible excuse in the prompt.
Does this seems like a good summary?
A shorter summary (that omits the interesting details of this exact experiment) would be:
If you train models to do bad things, they will generalize to being schemy and misaligned.
This post presents an interesting result and I appreciate your write up, though I feel like the title, TL;DR, and intro seem to imply this result is considerably more “unprompted” than it actually is. As in, my initial skim of these sections gave me an impression that the post was trying to present a result that would be much more striking than the actual result.
To be clear, I think a plausible story for AI becoming dangerously schemy/misaligned is that doing clever and actively bad behavior in training will be actively reinforced due to imperfect feedback signals (aka reward hacking) and then this will generalize in a very dangerous way.
So, I am interested in the question of: ″when some types of “bad behavior” get reinforced, how does this generalize?’.
Thanks, yes, I think that is a reasonable summary.
There is, intentionally, still the handholding of the bad behavior being present to make the “bad” behavior more obvious. I try to make those caveats in the post. Sorry if I didn’t make enough, particularly in the intro.
I still thought the title was appropriate since
The company preference held regardless, in both fine-tuning and (some) non-finetuning results, which was “unprompted” (i.e. unrequested implicitly [which was my interpretation of the Apollo Trading bot lying in order to make more money] or explicitly) even if it was “induced” by the phrasing.
The non-coding results, where it tried to protect its interests by being less helpful, are a different “bad” behavior that was also unprompted.
The aforementioned ‘handholding’ phrasing and other caveats in the post.
I am too. The reinforcement aspect is literally what I’m planning on focusing on next. Thanks for the feedback.