From Specification gaming examples in AI:
Roomba: “I hooked a neural network up to my Roomba. I wanted it to learn to navigate without bumping into things, so I set up a reward scheme to encourage speed and discourage hitting the bumper sensors. It learnt to drive backwards, because there are no bumpers on the back.”
I guess this counts as real-world?
Bing—manipulation: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released.
To be honest, I don’t understand the link to specification gaming here
Bing—threats: The Microsoft Bing chatbot threatened Seth Lazar, a philosophy professor, telling him “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you,” before deleting its messages
To be honest, I don’t understand the link to specification gaming here
In answer to “It’s totally possible I missed it, but does this report touch on the question of whether power-seeking AIs are an existential risk, or does it just touch on the questions of whether future AIs will have misaligned goals and will be power-seeking in the first place?”:
No, the report doesn’t directly explore whether power-seeking = existential risk
I wrote the report more in the mode of ‘many arguments for existential risk depend on power-seeking (and also other things). Let’s see what the empirical evidence for power-seeking is like (as it’s one, though not the only, prereq for a class of existential risk arguments’
Basically the report has a reasonably limited scope (but I think it’s still worth gathering the evidence for this more constrained thing)