The danger is not in paperclip maximizers, it is in simple and yet easy to specify utility functions. For example, the basic goal of “maximize knowledge” is probably much easier to specify than a human friendly utility function. Likewise the maximization of future freedom of action proposal from Wissner-Gross is pretty simple. But both probably result in very dangerous agents.
I think Ex Machina illustrated the most likely type of dangerous agent—it isn’t a paperclip maximizer. It’s more like a sociopath. A ULM with a too-simple initial utility function is likely to end up something like a sociopath.
This made me think. I’ve noticed that some machine learning types tend to have a tendency to dismiss MIRI’s standard “suppose we programmed an AI to build paperclips and it then proceeded to convert the world into paperclips” examples with a reaction like “duh, general AIs are not going to be programmed with goals directly in that way, these guys don’t know what they’re talking about”.
Which is fair on one hand, but also missing the point on the other hand.
It could be valuable to write a paper pointing out that sure, even if forget about that paperclipping example and instead assume a more deep learning-style AI that needs to grow and be given its goals in a more organic manner, most of the standard arguments about AI risk still hold.
Agreed that this would be valuable. I can’t measure it exactly, but I believe it took me some extra time/cognitive steps to get over the paperclip thing and realize that the more general point about human utility functions being difficult to specify is still quite true in any ML approach.
I’ve written about this before. The argument goes something like this.
RL implies self preservation, since dying prevents you from obtaining more reward. And self preservation leads to undesirable behavior.
E.g. making as many copies of yourself as possible for redundancy. Or destroying anything that has the tiniest probability of being a threat. Or trying to store as much mass and energy as possible to last against the heat death of the universe.
Or, you know, just maximizing your reward signal by wiring it that way in hardware. This would reduce your planning gradient to zero, which would suck for gradient-based planning algorithms, but there are also planning algorithms more closely tied to world-states that don’t rely on a reward gradient.
Is this a mathematical argument, or a verbal argument?
Specifically, what eli_sennesh means by a “planning gradient” is that you compare a plan to alternative plans around it, and switch plans in the direction of more reward. If your reward function returns infinity for any possible plan, then you will be indifferent among all plans, and your utility function will not constrain what actions you take at all, and your behavior is ‘unspecified.’
I think you’re implicitly assuming that the reward function is housed in some other logic, and so it’s not that the AI is infinitely satisfied by every possibility, but that the AI is infinitely satisfied by continuing to exist, and thus seeks to maximize the amount of time that it exists. But if you’re going to wirehead, why would you leave this potential source for disappointment around, instead of making the entire reward logic just return “everything is as good as it could possibly be”?
We have argued that the reinforcement-learning, goal-seeking and predictionseeking
agents all take advantage of the realistic opportunity to modify their
inputs right before receiving them. This behavior is undesirable as the agents
no longer maximize their utility with respect to the true (inner) environment
but instead become mere survival agents, trying only to avoid those dangerous
states where their code could be modified by the environment.
Yes, that’s the basic problem with considering the reward signal to be a feature, to be maximized without reference to causal structure, rather than a variable internal to the world-model.
Again: that depends what planning algorithm it uses. Many reinforcement learners use planning algorithms which presume that the reward signal has no causal relationship to the world-model. Once these learners wirehead themselves, they’re effectively dead due to the AIXI Anvil-on-Head Problem, because they were programmed to assume that there’s no relationship between their physical existence and their reward signal, and they then destroyed the tenuous, data-driven correlation between the two.
I’m having a very hard time modelling how different AI types would act in extreme scenarios like that. I’m surprised there isn’t more written about this, because it seems extremely important to whether UFAI is even a threat at all. I would be very relieved if that was the case, but it doesn’t seem obvious to me.
Particularly I worry about AIs that predict future reward directly, and then just take the local action that predicts the highest future reward. Like is typically done in reinforcement learning. An example would be Deepmind’s Atari playing AI which got a lot of press.
I don’t think AIs with entire world models that use general planning algorithms would scale to real world problems.Too much irrelevant information to model, too large a search space to search.
As they train their internal model to predict what their reward will be in x time steps, and as x goes to infinity, they care more and more about self preservation. Even if they have already hijacked the reward signal completely.
It could be valuable to write a paper pointing out that sure, even if forget about that paperclipping example and instead assume a more deep learning-style AI that needs to grow and be given its goals in a more organic manner, most of the standard arguments about AI risk still hold.
Yes, a better example than Clippie is rather overdue.
This made me think. I’ve noticed that some machine learning types tend to have a tendency to dismiss MIRI’s standard “suppose we programmed an AI to build paperclips and it then proceeded to convert the world into paperclips” examples with a reaction like “duh, general AIs are not going to be programmed with goals directly in that way, these guys don’t know what they’re talking about”.
Which is fair on one hand, but also missing the point on the other hand.
It could be valuable to write a paper pointing out that sure, even if forget about that paperclipping example and instead assume a more deep learning-style AI that needs to grow and be given its goals in a more organic manner, most of the standard arguments about AI risk still hold.
Adding that to my todo-list...
Agreed that this would be valuable. I can’t measure it exactly, but I believe it took me some extra time/cognitive steps to get over the paperclip thing and realize that the more general point about human utility functions being difficult to specify is still quite true in any ML approach.
I’ve written about this before. The argument goes something like this.
RL implies self preservation, since dying prevents you from obtaining more reward. And self preservation leads to undesirable behavior.
E.g. making as many copies of yourself as possible for redundancy. Or destroying anything that has the tiniest probability of being a threat. Or trying to store as much mass and energy as possible to last against the heat death of the universe.
Or, you know, just maximizing your reward signal by wiring it that way in hardware. This would reduce your planning gradient to zero, which would suck for gradient-based planning algorithms, but there are also planning algorithms more closely tied to world-states that don’t rely on a reward gradient.
Even if the AI wires it’s reward signal to +INF, it probably still would consider time, and therefore self preservation.
Is this a mathematical argument, or a verbal argument?
Specifically, what eli_sennesh means by a “planning gradient” is that you compare a plan to alternative plans around it, and switch plans in the direction of more reward. If your reward function returns infinity for any possible plan, then you will be indifferent among all plans, and your utility function will not constrain what actions you take at all, and your behavior is ‘unspecified.’
I think you’re implicitly assuming that the reward function is housed in some other logic, and so it’s not that the AI is infinitely satisfied by every possibility, but that the AI is infinitely satisfied by continuing to exist, and thus seeks to maximize the amount of time that it exists. But if you’re going to wirehead, why would you leave this potential source for disappointment around, instead of making the entire reward logic just return “everything is as good as it could possibly be”?
Here’s one mathematical argument for it, based on the assumption that the AI can rewire its reward channel but not the whole reward/planning function: http://www.agroparistech.fr/mmip/maths/laurent_orseau/papers/ring-orseau-AGI-2011-delusion.pdf
Yes, that’s the basic problem with considering the reward signal to be a feature, to be maximized without reference to causal structure, rather than a variable internal to the world-model.
Again: that depends what planning algorithm it uses. Many reinforcement learners use planning algorithms which presume that the reward signal has no causal relationship to the world-model. Once these learners wirehead themselves, they’re effectively dead due to the AIXI Anvil-on-Head Problem, because they were programmed to assume that there’s no relationship between their physical existence and their reward signal, and they then destroyed the tenuous, data-driven correlation between the two.
I’m having a very hard time modelling how different AI types would act in extreme scenarios like that. I’m surprised there isn’t more written about this, because it seems extremely important to whether UFAI is even a threat at all. I would be very relieved if that was the case, but it doesn’t seem obvious to me.
Particularly I worry about AIs that predict future reward directly, and then just take the local action that predicts the highest future reward. Like is typically done in reinforcement learning. An example would be Deepmind’s Atari playing AI which got a lot of press.
I don’t think AIs with entire world models that use general planning algorithms would scale to real world problems.Too much irrelevant information to model, too large a search space to search.
As they train their internal model to predict what their reward will be in x time steps, and as x goes to infinity, they care more and more about self preservation. Even if they have already hijacked the reward signal completely.
Yes, a better example than Clippie is rather overdue.