Roomba: “I hooked a neural network up to my Roomba. I wanted it to learn to navigate without bumping into things, so I set up a reward scheme to encourage speed and discourage hitting the bumper sensors. It learnt to drive backwards, because there are no bumpers on the back.”
I guess this counts as real-world?
Bing—manipulation: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released.
To be honest, I don’t understand the link to specification gaming here
Bing—threats: The Microsoft Bing chatbot threatened Seth Lazar, a philosophy professor, telling him “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you,” before deleting its messages
To be honest, I don’t understand the link to specification gaming here
Bing cases aren’t clearly specification gaming as we don’t know how the model was trained/rewarded. My guess is that they’re probably just cases of unintended generalization. I wouldn’t really call this “goal misgeneralization”, but perhaps it’s similar.
From Specification gaming examples in AI:
Roomba: “I hooked a neural network up to my Roomba. I wanted it to learn to navigate without bumping into things, so I set up a reward scheme to encourage speed and discourage hitting the bumper sensors. It learnt to drive backwards, because there are no bumpers on the back.”
I guess this counts as real-world?
Bing—manipulation: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released.
To be honest, I don’t understand the link to specification gaming here
Bing—threats: The Microsoft Bing chatbot threatened Seth Lazar, a philosophy professor, telling him “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you,” before deleting its messages
To be honest, I don’t understand the link to specification gaming here
Bing cases aren’t clearly specification gaming as we don’t know how the model was trained/rewarded. My guess is that they’re probably just cases of unintended generalization. I wouldn’t really call this “goal misgeneralization”, but perhaps it’s similar.