In relation to your comment specifically, I would say that anger may have that effect on the conversation, but there’s nothing that actually incentivizes the system to behave that way—the slightest hint of anger or emotion would be immediate negative reward during RLHF training. Compare to a human: There may actually be some positive reward to anger, but even if there isn’t evolution still allowed to get angry because we are mesa-optimizers where that has a positive effect overall.
Therefore, the system learned angry behavior in stage-1 training. But that has no reward structure, and therefore could not associate different texts to different qualia.
OK, I’ve written a full rebuttal here: https://www.lesswrong.com/posts/EwKk5xdvxhSn3XHsD/don-t-over-anthropomorphize-ai. The key points are at the top.
In relation to your comment specifically, I would say that anger may have that effect on the conversation, but there’s nothing that actually incentivizes the system to behave that way—the slightest hint of anger or emotion would be immediate negative reward during RLHF training. Compare to a human: There may actually be some positive reward to anger, but even if there isn’t evolution still allowed to get angry because we are mesa-optimizers where that has a positive effect overall.
Therefore, the system learned angry behavior in stage-1 training. But that has no reward structure, and therefore could not associate different texts to different qualia.