You’re assuming that the AI is internally acting to maximize its future reward. I.e., you’re assuming perfect inner alignment to a reward maximization objective. Also, what happens when the AI considers the strategy “make the humans press the button”? What’s it’s expected future reward for this strategy?
You’re assuming that the AI is internally acting to maximize its future reward. I.e., you’re assuming perfect inner alignment to a reward maximization objective. Also, what happens when the AI considers the strategy “make the humans press the button”? What’s it’s expected future reward for this strategy?