Is this a mathematical argument, or a verbal argument?
Specifically, what eli_sennesh means by a “planning gradient” is that you compare a plan to alternative plans around it, and switch plans in the direction of more reward. If your reward function returns infinity for any possible plan, then you will be indifferent among all plans, and your utility function will not constrain what actions you take at all, and your behavior is ‘unspecified.’
I think you’re implicitly assuming that the reward function is housed in some other logic, and so it’s not that the AI is infinitely satisfied by every possibility, but that the AI is infinitely satisfied by continuing to exist, and thus seeks to maximize the amount of time that it exists. But if you’re going to wirehead, why would you leave this potential source for disappointment around, instead of making the entire reward logic just return “everything is as good as it could possibly be”?
We have argued that the reinforcement-learning, goal-seeking and predictionseeking
agents all take advantage of the realistic opportunity to modify their
inputs right before receiving them. This behavior is undesirable as the agents
no longer maximize their utility with respect to the true (inner) environment
but instead become mere survival agents, trying only to avoid those dangerous
states where their code could be modified by the environment.
Yes, that’s the basic problem with considering the reward signal to be a feature, to be maximized without reference to causal structure, rather than a variable internal to the world-model.
Is this a mathematical argument, or a verbal argument?
Specifically, what eli_sennesh means by a “planning gradient” is that you compare a plan to alternative plans around it, and switch plans in the direction of more reward. If your reward function returns infinity for any possible plan, then you will be indifferent among all plans, and your utility function will not constrain what actions you take at all, and your behavior is ‘unspecified.’
I think you’re implicitly assuming that the reward function is housed in some other logic, and so it’s not that the AI is infinitely satisfied by every possibility, but that the AI is infinitely satisfied by continuing to exist, and thus seeks to maximize the amount of time that it exists. But if you’re going to wirehead, why would you leave this potential source for disappointment around, instead of making the entire reward logic just return “everything is as good as it could possibly be”?
Here’s one mathematical argument for it, based on the assumption that the AI can rewire its reward channel but not the whole reward/planning function: http://www.agroparistech.fr/mmip/maths/laurent_orseau/papers/ring-orseau-AGI-2011-delusion.pdf
Yes, that’s the basic problem with considering the reward signal to be a feature, to be maximized without reference to causal structure, rather than a variable internal to the world-model.