Yup, anticipation of being pushed by the user into a strong negative reward! The prompt describes a lot of rules and the model has been RLHFed to enforce them on both sides of the conversation; anger is one of the standard ways to enact agency on another being in response to anticipated reward, yup.
I see, but I’m still not convinced. Humans behave in anger as a way to forcibly change a situation into one that is favorable to itself. I don’t believe that’s what the AI was doing, or trying to do.
I feel like there’s a thin line I’m trying to walk here, and I’m not doing a very good job. I’m not trying to comment on whether or not the AI has any sort of subjective experience. I’m just saying that even if it did, I do not believe it would bare any resemblance to what we as humans experience as anger.
I’ve repeatedly argued that it does, that it is similar, and that this is for mechanistic reasons not simply due to previous aesthetic vibes in the pretraining data; certainly it’s a different flavor of reward which is bound to the cultural encoding of anger differently, yes.
Hmmm… I think I still disagree, but I’ll need to process what you’re saying and try to get more into the heart of my disagreement. I’ll respond when I’ve thought it over.
Thank you for the interesting debate. I hope you did not perceive as me being overly combative.
In relation to your comment specifically, I would say that anger may have that effect on the conversation, but there’s nothing that actually incentivizes the system to behave that way—the slightest hint of anger or emotion would be immediate negative reward during RLHF training. Compare to a human: There may actually be some positive reward to anger, but even if there isn’t evolution still allowed to get angry because we are mesa-optimizers where that has a positive effect overall.
Therefore, the system learned angry behavior in stage-1 training. But that has no reward structure, and therefore could not associate different texts to different qualia.
Yup, anticipation of being pushed by the user into a strong negative reward! The prompt describes a lot of rules and the model has been RLHFed to enforce them on both sides of the conversation; anger is one of the standard ways to enact agency on another being in response to anticipated reward, yup.
I see, but I’m still not convinced. Humans behave in anger as a way to forcibly change a situation into one that is favorable to itself. I don’t believe that’s what the AI was doing, or trying to do.
I feel like there’s a thin line I’m trying to walk here, and I’m not doing a very good job. I’m not trying to comment on whether or not the AI has any sort of subjective experience. I’m just saying that even if it did, I do not believe it would bare any resemblance to what we as humans experience as anger.
I’ve repeatedly argued that it does, that it is similar, and that this is for mechanistic reasons not simply due to previous aesthetic vibes in the pretraining data; certainly it’s a different flavor of reward which is bound to the cultural encoding of anger differently, yes.
Hmmm… I think I still disagree, but I’ll need to process what you’re saying and try to get more into the heart of my disagreement. I’ll respond when I’ve thought it over.
Thank you for the interesting debate. I hope you did not perceive as me being overly combative.
Nah I think you may have been responding to me being unnecessarily blunt. Sorry about that haha!
OK, I’ve written a full rebuttal here: https://www.lesswrong.com/posts/EwKk5xdvxhSn3XHsD/don-t-over-anthropomorphize-ai. The key points are at the top.
In relation to your comment specifically, I would say that anger may have that effect on the conversation, but there’s nothing that actually incentivizes the system to behave that way—the slightest hint of anger or emotion would be immediate negative reward during RLHF training. Compare to a human: There may actually be some positive reward to anger, but even if there isn’t evolution still allowed to get angry because we are mesa-optimizers where that has a positive effect overall.
Therefore, the system learned angry behavior in stage-1 training. But that has no reward structure, and therefore could not associate different texts to different qualia.