This is a cool post, and you convincingly demonstrate something-like-mode-collapse, and it’s definitely no longer a simulator outputting probabilities, but most of these phenomena feel like they could have other explanations than “consequentialism”, which feels like a stretch absent further justification.
In the initial “are bugs real” example, the second statement always contrasts the first, and never actually affects the last statement (which always stays “it’s up to the individual.”). If we found examples where there were preceding steps generated to logically justify a consistent conclusion, this’d be much stronger. (The wedding party anecdote might’ve qualified if it were reproducible.)
The greentext example also seems like mode collapse (towards leaving), but not multi-step mode collapse (since the first step, saying it’s absurd and Lemoine being sad, is usually the only reasonable consequence of saying “I was kidding”).
I know this feels like nitpicking, and I’m not saying your evidence is worthless, but it wouldn’t stand up to scrutiny outside this community and I’d rather get it to the point that it would.
The behavior of ending a story and starting a new, more optimal one seems like possibly an example of instrumentally convergent power-seeking, in Turner et al’s sense of “navigating towards larger sets of potential terminal states”.
I’m a little skeptical of this hypothetical example, if (as was done in the original RLHF paper afaict) the model was not trained to maximize multi-episode reward, but rather single-episode reward. In that case, it would never have had cause to value the output after its initial segment ended.
This is a cool post, and you convincingly demonstrate something-like-mode-collapse, and it’s definitely no longer a simulator outputting probabilities, but most of these phenomena feel like they could have other explanations than “consequentialism”, which feels like a stretch absent further justification.
In the initial “are bugs real” example, the second statement always contrasts the first, and never actually affects the last statement (which always stays “it’s up to the individual.”). If we found examples where there were preceding steps generated to logically justify a consistent conclusion, this’d be much stronger. (The wedding party anecdote might’ve qualified if it were reproducible.)
The greentext example also seems like mode collapse (towards leaving), but not multi-step mode collapse (since the first step, saying it’s absurd and Lemoine being sad, is usually the only reasonable consequence of saying “I was kidding”).
I know this feels like nitpicking, and I’m not saying your evidence is worthless, but it wouldn’t stand up to scrutiny outside this community and I’d rather get it to the point that it would.
I’m a little skeptical of this hypothetical example, if (as was done in the original RLHF paper afaict) the model was not trained to maximize multi-episode reward, but rather single-episode reward. In that case, it would never have had cause to value the output after its initial segment ended.