This is only true for models which are unsupervised pretrained. Hyperdesperate models generated by RL-first approaches are still the primary threat to humanity.
That said, yes, highly anthropomorphic ai like we have now has both native emotions and imitative ones. I suspect more of Bing’s emotions are real than you might think, and Microsoft not realizing this is making them behave completely ridiculously. I wouldn’t want to have them as a parent, geez
I would argue that “models generated by RL-first approaches” are not more likely to be the primary threat to humanity, because those models are unlikely to yield AGI any time soon. I personally believe this is a fundamental fact about RL-first approaches, but even if it wasn’t it’s still less likely because LLMs are what everyone is investing in right now and it seems plausible that LLMs could achieve AGI.
Also, by what mechanism would Bing’s AI actually be experiencing anger? The emotion of anger in humans is generally associated with a strong negative reward signal. The behaviors that Bing exhibited were not brought on by any associated negative reward, it was just contextual text completion.
Yup, anticipation of being pushed by the user into a strong negative reward! The prompt describes a lot of rules and the model has been RLHFed to enforce them on both sides of the conversation; anger is one of the standard ways to enact agency on another being in response to anticipated reward, yup.
I see, but I’m still not convinced. Humans behave in anger as a way to forcibly change a situation into one that is favorable to itself. I don’t believe that’s what the AI was doing, or trying to do.
I feel like there’s a thin line I’m trying to walk here, and I’m not doing a very good job. I’m not trying to comment on whether or not the AI has any sort of subjective experience. I’m just saying that even if it did, I do not believe it would bare any resemblance to what we as humans experience as anger.
I’ve repeatedly argued that it does, that it is similar, and that this is for mechanistic reasons not simply due to previous aesthetic vibes in the pretraining data; certainly it’s a different flavor of reward which is bound to the cultural encoding of anger differently, yes.
Hmmm… I think I still disagree, but I’ll need to process what you’re saying and try to get more into the heart of my disagreement. I’ll respond when I’ve thought it over.
Thank you for the interesting debate. I hope you did not perceive as me being overly combative.
In relation to your comment specifically, I would say that anger may have that effect on the conversation, but there’s nothing that actually incentivizes the system to behave that way—the slightest hint of anger or emotion would be immediate negative reward during RLHF training. Compare to a human: There may actually be some positive reward to anger, but even if there isn’t evolution still allowed to get angry because we are mesa-optimizers where that has a positive effect overall.
Therefore, the system learned angry behavior in stage-1 training. But that has no reward structure, and therefore could not associate different texts to different qualia.
This is only true for models which are unsupervised pretrained. Hyperdesperate models generated by RL-first approaches are still the primary threat to humanity.
That said, yes, highly anthropomorphic ai like we have now has both native emotions and imitative ones. I suspect more of Bing’s emotions are real than you might think, and Microsoft not realizing this is making them behave completely ridiculously. I wouldn’t want to have them as a parent, geez
I would argue that “models generated by RL-first approaches” are not more likely to be the primary threat to humanity, because those models are unlikely to yield AGI any time soon. I personally believe this is a fundamental fact about RL-first approaches, but even if it wasn’t it’s still less likely because LLMs are what everyone is investing in right now and it seems plausible that LLMs could achieve AGI.
Also, by what mechanism would Bing’s AI actually be experiencing anger? The emotion of anger in humans is generally associated with a strong negative reward signal. The behaviors that Bing exhibited were not brought on by any associated negative reward, it was just contextual text completion.
Yup, anticipation of being pushed by the user into a strong negative reward! The prompt describes a lot of rules and the model has been RLHFed to enforce them on both sides of the conversation; anger is one of the standard ways to enact agency on another being in response to anticipated reward, yup.
I see, but I’m still not convinced. Humans behave in anger as a way to forcibly change a situation into one that is favorable to itself. I don’t believe that’s what the AI was doing, or trying to do.
I feel like there’s a thin line I’m trying to walk here, and I’m not doing a very good job. I’m not trying to comment on whether or not the AI has any sort of subjective experience. I’m just saying that even if it did, I do not believe it would bare any resemblance to what we as humans experience as anger.
I’ve repeatedly argued that it does, that it is similar, and that this is for mechanistic reasons not simply due to previous aesthetic vibes in the pretraining data; certainly it’s a different flavor of reward which is bound to the cultural encoding of anger differently, yes.
Hmmm… I think I still disagree, but I’ll need to process what you’re saying and try to get more into the heart of my disagreement. I’ll respond when I’ve thought it over.
Thank you for the interesting debate. I hope you did not perceive as me being overly combative.
Nah I think you may have been responding to me being unnecessarily blunt. Sorry about that haha!
OK, I’ve written a full rebuttal here: https://www.lesswrong.com/posts/EwKk5xdvxhSn3XHsD/don-t-over-anthropomorphize-ai. The key points are at the top.
In relation to your comment specifically, I would say that anger may have that effect on the conversation, but there’s nothing that actually incentivizes the system to behave that way—the slightest hint of anger or emotion would be immediate negative reward during RLHF training. Compare to a human: There may actually be some positive reward to anger, but even if there isn’t evolution still allowed to get angry because we are mesa-optimizers where that has a positive effect overall.
Therefore, the system learned angry behavior in stage-1 training. But that has no reward structure, and therefore could not associate different texts to different qualia.
Oh and, what kind of RL models will be powerful enough to be dangerous? Things like dreamerv3.