Quintin Pope answers What is wrong with this approach to corrigibility?

Quintin Pope 13 Jul 2022 4:06 UTC
2 points
−1
You’re assuming that the AI is internally acting to maximize its future reward. I.e., you’re assuming perfect inner alignment to a reward maximization objective. Also, what happens when the AI considers the strategy “make the humans press the button”? What’s it’s expected future reward for this strategy?