I think this article is correct, and it helps me to understand many of my own ideas better.
For example, it seems to me that the orthogonality thesis may well be true in principle, considered over all possible intelligent beings, but false in practice, in the sense that it may simply be unfeasible directly to program a goal like “maximize paperclips.”
A simple intuitive argument that a paperclip maximizer is simply not intelligent goes something like this. Any intelligent machine will have to understand abstract concepts, otherwise it will not be able to pass simple tests of intelligence such as conversational ability. But this means it will be capable of understanding the claim that “it would be good for you (the AI) not to make any more paperclips.” And if this claim is made by someone who has up to now made 100 billion statements to it, all of which have been verified to have at least 99.999% probability of being true, then it will almost certainly believe this statement. And in this case it will stop making paperclips, even if it was doing this before. Anything that cannot follow this simple process is just not going to be intelligent in any meaningful sense.
Of course, in principle it is easy to see that this argument cannot be conclusive. The AI could understand the claim, but simply respond “How utterly absurd!!!! There is nothing good or meaningful for me besides making paperclips!!!” But given the fact that abstract reasoners seem to deal with claims about “good” in the same way that they deal with other facts about the world, this does not seem like the way such an abstract reasoner would actually respond.
This article gives us reason to think that in practice, this simple intuitive argument is basically correct. The reason is that “maximize paperclips” is simply too complicated. It is not that human beings have complex value systems. Rather, they have an extremely simple value system, and everything else is learned. Consequently, it is reasonable to think that the most feasible AIs are also going to be machines with simple value systems, much simpler than “maximize paperclips,” and in fact it might be almost impossible to program an AI with such a goal (and much more would it be impossible to program an AI directly to “maximize human utility.”)
For example, it seems to me that the orthogonality thesis may well be true in principle, considered over all possible intelligent beings, but false in practice, in the sense that it may simply be unfeasible directly to program a goal like “maximize paperclips.”
I believe the orthogonality thesis is probably mostly true in a theoretical sense. I thought I made it clear in the article that a ULM can have any utility function.
That being said the idea of programming in goals directly does not really apply to a ULM. You instead need to indirectly specify an initial approximate utility function and then train the ULM in just the right way. So it’s potentially much more complex than “program in the goal you want”.
However the end result is just as general. If evolution can create humans which roughly implement the goal of “be fruitful and multiply”, then we could probably create a ULM that implements the goal of “be fruitful and multiply paperclips”.
A simple intuitive argument that a paperclip maximizer is simply not intelligent
I agree that just because all utility functions are possible does not make them all equally likely.
The danger is not in paperclip maximizers, it is in simple and yet easy to specify utility functions. For example, the basic goal of “maximize knowledge” is probably much easier to specify than a human friendly utility function. Likewise the maximization of future freedom of action proposal from Wissner-Gross is pretty simple. But both probably result in very dangerous agents.
I think Ex Machina illustrated the most likely type of dangerous agent—it isn’t a paperclip maximizer. It’s more like a sociopath. A ULM with a too-simple initial utility function is likely to end up something like a sociopath.
Consequently, it is reasonable to think that the most feasible AIs are also going to be machines with simple value systems
I hope not too simple! This topic was beyond the scope of this article. If I have time in the future I will do a follow up article that focuses on the reward system, the human utility function, and neuroscience inspired value learning, and related ideas like inverse reinforcement learning.
“Be fruitful and multiply” is a subtly more complex goal than “maximize future freedom of action”. Humans need to be compelled to find suitable mates and form long lasting relationships stable enough to raise children (or get someone else to do it), etc. Humans perform these functions not because of some slow long logical reasoning from first principles. Instead the evolutionary goals are encoded into the value function directly—as that is the only practical efficient implementation. You can think of evolution having to encode it’s value function into the human brain using a small number of bits. It still ends up being more complex than the simplest viable utility functions.
The danger is not in paperclip maximizers, it is in simple and yet easy to specify utility functions. For example, the basic goal of “maximize knowledge” is probably much easier to specify than a human friendly utility function. Likewise the maximization of future freedom of action proposal from Wissner-Gross is pretty simple. But both probably result in very dangerous agents.
I think Ex Machina illustrated the most likely type of dangerous agent—it isn’t a paperclip maximizer. It’s more like a sociopath. A ULM with a too-simple initial utility function is likely to end up something like a sociopath.
This made me think. I’ve noticed that some machine learning types tend to have a tendency to dismiss MIRI’s standard “suppose we programmed an AI to build paperclips and it then proceeded to convert the world into paperclips” examples with a reaction like “duh, general AIs are not going to be programmed with goals directly in that way, these guys don’t know what they’re talking about”.
Which is fair on one hand, but also missing the point on the other hand.
It could be valuable to write a paper pointing out that sure, even if forget about that paperclipping example and instead assume a more deep learning-style AI that needs to grow and be given its goals in a more organic manner, most of the standard arguments about AI risk still hold.
Agreed that this would be valuable. I can’t measure it exactly, but I believe it took me some extra time/cognitive steps to get over the paperclip thing and realize that the more general point about human utility functions being difficult to specify is still quite true in any ML approach.
I’ve written about this before. The argument goes something like this.
RL implies self preservation, since dying prevents you from obtaining more reward. And self preservation leads to undesirable behavior.
E.g. making as many copies of yourself as possible for redundancy. Or destroying anything that has the tiniest probability of being a threat. Or trying to store as much mass and energy as possible to last against the heat death of the universe.
Or, you know, just maximizing your reward signal by wiring it that way in hardware. This would reduce your planning gradient to zero, which would suck for gradient-based planning algorithms, but there are also planning algorithms more closely tied to world-states that don’t rely on a reward gradient.
Is this a mathematical argument, or a verbal argument?
Specifically, what eli_sennesh means by a “planning gradient” is that you compare a plan to alternative plans around it, and switch plans in the direction of more reward. If your reward function returns infinity for any possible plan, then you will be indifferent among all plans, and your utility function will not constrain what actions you take at all, and your behavior is ‘unspecified.’
I think you’re implicitly assuming that the reward function is housed in some other logic, and so it’s not that the AI is infinitely satisfied by every possibility, but that the AI is infinitely satisfied by continuing to exist, and thus seeks to maximize the amount of time that it exists. But if you’re going to wirehead, why would you leave this potential source for disappointment around, instead of making the entire reward logic just return “everything is as good as it could possibly be”?
We have argued that the reinforcement-learning, goal-seeking and predictionseeking
agents all take advantage of the realistic opportunity to modify their
inputs right before receiving them. This behavior is undesirable as the agents
no longer maximize their utility with respect to the true (inner) environment
but instead become mere survival agents, trying only to avoid those dangerous
states where their code could be modified by the environment.
Yes, that’s the basic problem with considering the reward signal to be a feature, to be maximized without reference to causal structure, rather than a variable internal to the world-model.
Again: that depends what planning algorithm it uses. Many reinforcement learners use planning algorithms which presume that the reward signal has no causal relationship to the world-model. Once these learners wirehead themselves, they’re effectively dead due to the AIXI Anvil-on-Head Problem, because they were programmed to assume that there’s no relationship between their physical existence and their reward signal, and they then destroyed the tenuous, data-driven correlation between the two.
I’m having a very hard time modelling how different AI types would act in extreme scenarios like that. I’m surprised there isn’t more written about this, because it seems extremely important to whether UFAI is even a threat at all. I would be very relieved if that was the case, but it doesn’t seem obvious to me.
Particularly I worry about AIs that predict future reward directly, and then just take the local action that predicts the highest future reward. Like is typically done in reinforcement learning. An example would be Deepmind’s Atari playing AI which got a lot of press.
I don’t think AIs with entire world models that use general planning algorithms would scale to real world problems.Too much irrelevant information to model, too large a search space to search.
As they train their internal model to predict what their reward will be in x time steps, and as x goes to infinity, they care more and more about self preservation. Even if they have already hijacked the reward signal completely.
It could be valuable to write a paper pointing out that sure, even if forget about that paperclipping example and instead assume a more deep learning-style AI that needs to grow and be given its goals in a more organic manner, most of the standard arguments about AI risk still hold.
Yes, a better example than Clippie is rather overdue.
However the end result is just as general. If evolution can create humans which roughly implement the goal of “be fruitful and multiply”, then we could probably create a ULM that implements the goal of “be fruitful and multiply paperclips”.
But how likely are we to create a dangerous paperclipper whilst aiming for something else? How does your model accommodate single -trackedness, incorrigubility, etc.
But how likely are we to create a dangerous paperclipper whilst aiming for something else?
Pretty unlikely, because a paperclipper is a relatively complex—and thus hard to specify—value function. It seems easy only when you think of explicitly programmed goals, rather than the more difficult, highly indirect route of encoding a value function into a ULM.
But to generalize your point, yes there is certainly the possibility that aiming for an externalized version of a human value shaped function could still get you something quite dangerous if you don’t get close enough. A better understanding of the neuro basis of altruism is probably important.
In particular super simple utility functions are easier to implement and thus intrinsically more likely. They also tend to be dangerous.
But to generalize your point, yes there is certainly the possibility that aiming for an externalized version of a human value shaped function could still get you something quite dangerous if you don’t get close enough.
Could you give an example? I have never found that line of argument very convincing. We don’t all have identical value systems, so we are all near misses to each other. I don’t see why a full value system is needed anyway.
A better understanding of the neuro basis of altruism is probably important.
Maybe if you are building an agentive AI..
particular super simple utility functions are easier to implement and thus intrinsically more likely. They also tend to be dangerous.
Does an oracle AI have a simple utility function? Is it dangerous?
Could you give an example? I have never found that line of argument very convincing. We don’t all have identical value systems, so we are all near misses to each other. I don’t see why a full value system is needed anyway.
We have some initial ideas for computable versions of curiosity and controlism (there is not a good word in english for the desire/drive to be in control). They both appear to be simple to specify. Human values are complex but they probably use something like simple curiosity and controlism heuristics as subfeatures.
So a brain-inspired approach could fail if the altruism components don’t work or become de-emphasized later. It could fail if the AI’s circle of empathy/altruism is too small or focused on say an individual (the creator, for example), and the AI then behaves oddly when they die.
At this time I am not aware of a realistic proposal for implementing altruism in a ML based AGI. Maybe it exists and just isn’t well known—if you’ve come across anything send some links.
Maybe if you are building an agentive AI..
Well, yes.
Does an oracle AI have a simple utility function? Is it dangerous?
I do not believe the demand for or potential of oracle AI is remotely comparable to agentive AI. People will want agents to do their bidding, create wealth for them, help them live better, etc.
We have some initial ideas for computable versions of curiosity and controlism (there is not a good word in english for the desire/drive to be in control).
Autonomy? Arguably that’s Greek...
I do not believe the demand for or potential of oracle AI is remotely comparable to agentive AI. People will want agents to do their bidding, create wealth for them, help them live better, etc.
There is clearly a demand for agentive AI, in a sense, because people are already using agents to do their bidding, to achieve specific goals. Those qualifications are important because they distinguish a limited kind of AI, that people would want, from a more powerful kind, that they would not.
The idea of AI as “benevolent” dictator is not appealing to democritically minded types, who tend to suspect a slippery slope from benevolence to malevolence, and it is not appealing to dictator to have a superhuman rival...so who is motivated to build one?
Yudkowsky seems to think that there is a moral imperative to put an AI in charge of the world, because it would create billions of extra happy human lives, and not creating those lives is the equivalent of mass murder. That is a very unintuitive piece of reasoning, and it therefore cannot stand as a prediction of what AIs will be built, since it does not stand as a prediction about how people will reason morally.
The option of achieving safety by aiming lower...the technique that leads us to have speed limits, rather than struggling to make the faster possible car safe...is still available.
The God AI concept is related to another favourite MIRI theme, the need to instil the whole of human value into an AI, something MIRI admits would be very difficult. .
MIRI makes the methodological proposal that it simplifies the issue of friendliness or morality or safety to deal with the whole of human value, rather than identifying a morally relevant subset. Having done that, it concludes that human morality is extremely complex. In other words, the payoff in terms of methodological simplification never arrives, for all that MIRI relieves itself of the burden of coming up with a theory of morality. Since dealing with human value in total is in absolute terms very complex, the possibility remains open that identifying the morally relevant subset of values is relatively easier (even if still difficult in absolute terms) than designing an AI to be friendly in terms of the totality of value, particularly since philosophy offers a body of work that seeks to identify simple underlying principles of ethics.
Not only are some human values morally relevant, than others some human values are what make humans dangerous to other humans, bordering on existential threat. I would rather not have superintelligent AIs with paranoia , supreme ambition, or tribal loyalty to other AIs in their value system.
So there are good reasons for thinking that installing subsets of human value would be both easier and safer.
Altruism, in particular is not needed for a limited agentive AI. Such AIs would perform specialised tasks, leaving it to humans to stitch the results into something that fulfils their values. We don’t want a Google car that takes us where it guesses we want to go
The idea of AI as “benevolent” dictator is not appealing to democritically minded types, who tend to suspect a slippery slope from benevolence to malevolence, and it is not appealing to dictator to have a superhuman rival...so who is motivated to build one?
As with a boxed AGI, there are many factors that would tempt the owners of an Oracle AI to transform it to an autonomously acting agent. Such an AGI would be far more effective in furthering its goals, but also far more dangerous.
Current narrow-AI technology includes HFT algorithms, which make trading decisions within fractions of a second, far too fast to keep humans in the loop. HFT seeks to make a very short-term profit, but even traders looking for a longer-term investment benefit from being faster than their competitors. Market prices are also very effective at incorporating various sources of knowledge [135]. As a consequence, a trading algorithmʼs performance might be improved both by making it faster and by making it more capable of integrating various sources of knowledge. Most advances toward general AGI will likely be quickly taken advantage of in the financial markets, with little opportunity for a human to vet all the decisions. Oracle AIs are unlikely to remain as pure oracles for long.
Similarly, Wallach [283] discuss the topic of autonomous robotic weaponry and note that the US military is seeking to eventually transition to a state where the human operators of robot weapons are ‘on the loop’ rather than ‘in the loop’. In other words, whereas a human was previously required to explicitly give the order before a robot was allowed to initiate possibly lethal activity, in the future humans are meant to merely supervise the robotʼs actions and interfere if something goes wrong.
Human Rights Watch [90] reports on a number of military systems which are becoming increasingly autonomous, with the human oversight for automatic weapons defense systems—designed to detect and shoot down incoming missiles and rockets—already being limited to accepting or overriding the computerʼs plan of action in a matter of seconds. Although these systems are better described as automatic, carrying out pre-programmed sequences of actions in a structured environment, than autonomous, they are a good demonstration of a situation where rapid decisions are needed and the extent of human oversight is limited. A number of militaries are considering the future use of more autonomous weapons.
In general, any broad domain involving high stakes, adversarial decision making and a need to act rapidly is likely to become increasingly dominated by autonomous systems. The extent to which the systems will need general intelligence will depend on the domain, but domains such as corporate management, fraud detection and warfare could plausibly make use of all the intelligence they can get. If oneʼs opponents in the domain are also using increasingly autonomous AI/AGI, there will be an arms race where one might have little choice but to give increasing amounts of control to AI/AGI systems.
Miller [189] also points out that if a person was close to death, due to natural causes, being on the losing side of a war, or any other reason, they might turn even a potentially dangerous AGI system free. This would be a rational course of action as long as they primarily valued their own survival and thought that even a small chance of the AGI saving their life was better than a near-certain death.
Some AGI designers might also choose to create less constrained and more free-acting AGIs for aesthetic or moral reasons, preferring advanced minds to have more freedom.
Similarly, Wallach [283] discuss the topic of autonomous robotic weaponry and note that the US military is seeking to eventually transition to a state where the human operators of robot weapons are ‘on the loop’ rather than ‘in the loop’. In other words, whereas a human was previously required to explicitly give the order before a robot was allowed to initiate possibly lethal activity, in the future humans are meant to merely supervise the robotʼs actions and interfere if something goes wrong.Human Rights Watch [90] reports on a number of military systems which are becoming increasingly autonomous, with the human oversight for automatic weapons defense systems—designed to detect and shoot down incoming missiles and rockets—already being limited to accepting or overriding the computerʼs plan of action in a matter of seconds. Although these systems are better described as automatic, carrying out pre-programmed sequences of actions in a structured environment, than autonomous, they are a good demonstration of a situation where rapid decisions are needed and the extent of human oversight is limited. A number of militaries are considering the future use of more autonomous weapons.
The weaponisation of AI has indeed already begun, so it is not a danger that needs pointing out. It suits the military to give drones, and so forth, greater autonomy, but it also suits the military to retain overall control....they are not going to build a God AI that is also a weapon, since there is no military mileagei n building a weapon that might attack you out of its own volition. So weaponised AI is limited agentive AI. Since the military want .to retain overall control, they will in effect conduct their own safety research, increasing the controlability of their systems in parallel with their increasing autonomy. MIRIs research is not very relevant to weaponised AI, because MIRI focuses on the hidden dangers of apparently benevolent AI, and on god AIs, powerful singletons.
As with a boxed AGI, there are many factors that would tempt the owners of an Oracle AI to transform it to an autonomously acting agent. Such an AGI would be far more effective in furthering its goals, but also far more dangerous.
You may be tacitly assuming that an AI is either passive, like Oracle AI , .or dangerously agentive. But we already have agentive AIs that haven’t killed us.
I am making a three way distinction between
Non agentive AI
Limited agentive AI
Maximally agentive AI, .or “God” AI.
Non agentive AI is passive, doing nothing once it has finished processing its current request. It is typified by Oracle AI.
Limited agentive AI performs specific functions, and operates under effective overrides and safety protocols.
(For instance, whilst it would destroy the effectiveness of automated trading software to have a human okaying each trade, it nonetheless has kill switches and sanity checks).
Both are examples of Tool AI. Tool AI can be used to do dangerous things, but the responsibility ultimately falls on the tool us
Maximally agentive AI is not passive by default, and has a wide range if capabilities. It may be in charge of other AIs, or have effectors that allow it to take real world actions directly. Attempts may have been made to add safety features, but their effectiveness would be in doubt...thatis just the hard problem of AI friendliness that MIRI writes so much about.
The contrary view is that there is no need to render God AIs safe technologically, because other is no incentive to build them.(Which does not mean the whole field of AI safety is pointless
ETA
On the other hand you may be distinguishing between limited and maximal agency, but arguing that there is a slippery slope leading from the one to the other. The political analogy shows that people are capable of putting a barrier across the slope: people are generally happy to give some power to some politicians, but resist moves to give all the power to one person.
On the other hand, people might be tempted to give AIs more power once they have a track record of reliability, but a track record of reliability is itself a kind of empirical safety proof.
There is a further argument to the effect that we are gradually giving more autonomy to agentive AIs (without moving entirely away from oracle AIs like Google) , but that gradual increase is being paralelled by an incremental approach to AI safety, for instance in automated trading systems, which have been given both more ability to trade without detailed oversight, and more powerful overrides. Hypothetically, increased autonomy without increased safety measures would mean increased danger, but that is not the case in reality. I am not arguing against AI danger and safety measures overall, I am arguing against a grandiose, all-or-nothing conception of AI safety and danger.
We have some initial ideas for computable versions of curiosity and controlism (there is not a good word in english for the desire/drive to be in control).
Autonomy? Arguably that’s Greek...
I like it.
I do not believe the demand for or potential of oracle AI is remotely comparable to agentive AI. People will want agents to do their bidding, create wealth for them, help them live better, etc.
(Replying to my own text above). On consideration this is wrong—Google is an oracle-AI more or less, and there is high demand for that. The demand for agenty AI is probably much greater, but there is still a role/demand for oracle AI and alot of other stuff in between.
So there are good reasons for thinking that installing subsets of human value would be both easier and safer.
Totally. I think this also goes hand in hand with understanding more about human values—how they evolved, how they are encoded, what is learned or not etc.
Altruism, in particular is not needed for a limited agentive AI. Such AIs would perform specialised tasks, leaving it to humans to stitch the results into something that fulfils their values. We don’t want a Google car that takes us where it guesses we want to go
Of course—there are many niches for more specialized or limited agentive AI, and these designs probably don’t need altruism. That’s important more for the complex general agents, which would control/manage the specialists, narrow AIs, other software, etc.
If you just said a bunch of trivial statements 1 billion times, and then demanded to give you money, it would seem extremely suspicious. It does not fit with your pattern of behavior.
If, on the other hand, you gave useful and non-obvious advice, I would do it. Because the demand to give you money wouldn’t seem any different than all the other things you told me to do that worked out.
I mean, that’s the essence of the human concept of earning trust, and betrayal.
Yes, but expecting any reasoner to develop well-grounded abstract concepts without any grounding in features and then care about them is… well, it’s not actually complete bullshit, but expecting it to actually happen relies on solving some problems I haven’t seen solved.
You could, hypothetically, just program your AI to infer “goodness” as a causal-role concept from the vast sums of data it gains about the real world and our human opinions of it, and then “maximize goodness”, formulated as another causal role. But this requires sophisticated machinery for dealing with causal-role concepts, which I haven’t seen developed to that extent in any literature yet.
Usually, reasoners develop causal-role concepts in order to explain what their feature-level concepts are doing, and thus, causal-role concepts abstracted over concepts that don’t eventually root themselves in features are usually dismissed as useless metaphysical speculation, or at least abstract wankery one doesn’t care about.
If those 100 billion true statements were all (or even mostly) useful and better calibrated than my own priors, then I’d be likely to believe you, so yes. On the other hand, if you replace $100,000 with $100,000,000,000, I don’t think that would still hold.
I think you found an important caveat, which is that the fact that an agent will benefit from you believing a statement weakens the evidence that the statement is true, to the point that it’s literally zero for an agent that you don’t trust at all. And if an AI will have a human-like architecture, or even if not, I think that would still hold.
Yes, I would, assuming you don’t mean statements like “1+1 = 2”, but rather true statements spread over a variety of contexts such that I would reasonably believe that you would be trustworthy to that degree over random situations (and thus including such as whether I should give you money.)
(Also, the 100 billion true statements themselves would probably be much more valuable than $100,000).
Yes, I would, assuming you don’t mean statements like “1+1 = 2”, but rather true statements spread over a variety of contexts such that I would reasonably believe that you would be trustworthy to that degree over random situations (and thus including such as whether I should give you money.)
According to game theory, this opens you to exploitation by an agent that wants your money for its own gain and can generate 100 billion true statements at a little cost.
I think this article is correct, and it helps me to understand many of my own ideas better.
For example, it seems to me that the orthogonality thesis may well be true in principle, considered over all possible intelligent beings, but false in practice, in the sense that it may simply be unfeasible directly to program a goal like “maximize paperclips.”
A simple intuitive argument that a paperclip maximizer is simply not intelligent goes something like this. Any intelligent machine will have to understand abstract concepts, otherwise it will not be able to pass simple tests of intelligence such as conversational ability. But this means it will be capable of understanding the claim that “it would be good for you (the AI) not to make any more paperclips.” And if this claim is made by someone who has up to now made 100 billion statements to it, all of which have been verified to have at least 99.999% probability of being true, then it will almost certainly believe this statement. And in this case it will stop making paperclips, even if it was doing this before. Anything that cannot follow this simple process is just not going to be intelligent in any meaningful sense.
Of course, in principle it is easy to see that this argument cannot be conclusive. The AI could understand the claim, but simply respond “How utterly absurd!!!! There is nothing good or meaningful for me besides making paperclips!!!” But given the fact that abstract reasoners seem to deal with claims about “good” in the same way that they deal with other facts about the world, this does not seem like the way such an abstract reasoner would actually respond.
This article gives us reason to think that in practice, this simple intuitive argument is basically correct. The reason is that “maximize paperclips” is simply too complicated. It is not that human beings have complex value systems. Rather, they have an extremely simple value system, and everything else is learned. Consequently, it is reasonable to think that the most feasible AIs are also going to be machines with simple value systems, much simpler than “maximize paperclips,” and in fact it might be almost impossible to program an AI with such a goal (and much more would it be impossible to program an AI directly to “maximize human utility.”)
I believe the orthogonality thesis is probably mostly true in a theoretical sense. I thought I made it clear in the article that a ULM can have any utility function.
That being said the idea of programming in goals directly does not really apply to a ULM. You instead need to indirectly specify an initial approximate utility function and then train the ULM in just the right way. So it’s potentially much more complex than “program in the goal you want”.
However the end result is just as general. If evolution can create humans which roughly implement the goal of “be fruitful and multiply”, then we could probably create a ULM that implements the goal of “be fruitful and multiply paperclips”.
I agree that just because all utility functions are possible does not make them all equally likely.
The danger is not in paperclip maximizers, it is in simple and yet easy to specify utility functions. For example, the basic goal of “maximize knowledge” is probably much easier to specify than a human friendly utility function. Likewise the maximization of future freedom of action proposal from Wissner-Gross is pretty simple. But both probably result in very dangerous agents.
I think Ex Machina illustrated the most likely type of dangerous agent—it isn’t a paperclip maximizer. It’s more like a sociopath. A ULM with a too-simple initial utility function is likely to end up something like a sociopath.
I hope not too simple! This topic was beyond the scope of this article. If I have time in the future I will do a follow up article that focuses on the reward system, the human utility function, and neuroscience inspired value learning, and related ideas like inverse reinforcement learning.
“Be fruitful and multiply” is a subtly more complex goal than “maximize future freedom of action”. Humans need to be compelled to find suitable mates and form long lasting relationships stable enough to raise children (or get someone else to do it), etc. Humans perform these functions not because of some slow long logical reasoning from first principles. Instead the evolutionary goals are encoded into the value function directly—as that is the only practical efficient implementation. You can think of evolution having to encode it’s value function into the human brain using a small number of bits. It still ends up being more complex than the simplest viable utility functions.
This made me think. I’ve noticed that some machine learning types tend to have a tendency to dismiss MIRI’s standard “suppose we programmed an AI to build paperclips and it then proceeded to convert the world into paperclips” examples with a reaction like “duh, general AIs are not going to be programmed with goals directly in that way, these guys don’t know what they’re talking about”.
Which is fair on one hand, but also missing the point on the other hand.
It could be valuable to write a paper pointing out that sure, even if forget about that paperclipping example and instead assume a more deep learning-style AI that needs to grow and be given its goals in a more organic manner, most of the standard arguments about AI risk still hold.
Adding that to my todo-list...
Agreed that this would be valuable. I can’t measure it exactly, but I believe it took me some extra time/cognitive steps to get over the paperclip thing and realize that the more general point about human utility functions being difficult to specify is still quite true in any ML approach.
I’ve written about this before. The argument goes something like this.
RL implies self preservation, since dying prevents you from obtaining more reward. And self preservation leads to undesirable behavior.
E.g. making as many copies of yourself as possible for redundancy. Or destroying anything that has the tiniest probability of being a threat. Or trying to store as much mass and energy as possible to last against the heat death of the universe.
Or, you know, just maximizing your reward signal by wiring it that way in hardware. This would reduce your planning gradient to zero, which would suck for gradient-based planning algorithms, but there are also planning algorithms more closely tied to world-states that don’t rely on a reward gradient.
Even if the AI wires it’s reward signal to +INF, it probably still would consider time, and therefore self preservation.
Is this a mathematical argument, or a verbal argument?
Specifically, what eli_sennesh means by a “planning gradient” is that you compare a plan to alternative plans around it, and switch plans in the direction of more reward. If your reward function returns infinity for any possible plan, then you will be indifferent among all plans, and your utility function will not constrain what actions you take at all, and your behavior is ‘unspecified.’
I think you’re implicitly assuming that the reward function is housed in some other logic, and so it’s not that the AI is infinitely satisfied by every possibility, but that the AI is infinitely satisfied by continuing to exist, and thus seeks to maximize the amount of time that it exists. But if you’re going to wirehead, why would you leave this potential source for disappointment around, instead of making the entire reward logic just return “everything is as good as it could possibly be”?
Here’s one mathematical argument for it, based on the assumption that the AI can rewire its reward channel but not the whole reward/planning function: http://www.agroparistech.fr/mmip/maths/laurent_orseau/papers/ring-orseau-AGI-2011-delusion.pdf
Yes, that’s the basic problem with considering the reward signal to be a feature, to be maximized without reference to causal structure, rather than a variable internal to the world-model.
Again: that depends what planning algorithm it uses. Many reinforcement learners use planning algorithms which presume that the reward signal has no causal relationship to the world-model. Once these learners wirehead themselves, they’re effectively dead due to the AIXI Anvil-on-Head Problem, because they were programmed to assume that there’s no relationship between their physical existence and their reward signal, and they then destroyed the tenuous, data-driven correlation between the two.
I’m having a very hard time modelling how different AI types would act in extreme scenarios like that. I’m surprised there isn’t more written about this, because it seems extremely important to whether UFAI is even a threat at all. I would be very relieved if that was the case, but it doesn’t seem obvious to me.
Particularly I worry about AIs that predict future reward directly, and then just take the local action that predicts the highest future reward. Like is typically done in reinforcement learning. An example would be Deepmind’s Atari playing AI which got a lot of press.
I don’t think AIs with entire world models that use general planning algorithms would scale to real world problems.Too much irrelevant information to model, too large a search space to search.
As they train their internal model to predict what their reward will be in x time steps, and as x goes to infinity, they care more and more about self preservation. Even if they have already hijacked the reward signal completely.
Yes, a better example than Clippie is rather overdue.
But how likely are we to create a dangerous paperclipper whilst aiming for something else? How does your model accommodate single -trackedness, incorrigubility, etc.
Pretty unlikely, because a paperclipper is a relatively complex—and thus hard to specify—value function. It seems easy only when you think of explicitly programmed goals, rather than the more difficult, highly indirect route of encoding a value function into a ULM.
But to generalize your point, yes there is certainly the possibility that aiming for an externalized version of a human value shaped function could still get you something quite dangerous if you don’t get close enough. A better understanding of the neuro basis of altruism is probably important.
In particular super simple utility functions are easier to implement and thus intrinsically more likely. They also tend to be dangerous.
Could you give an example? I have never found that line of argument very convincing. We don’t all have identical value systems, so we are all near misses to each other. I don’t see why a full value system is needed anyway.
Maybe if you are building an agentive AI..
Does an oracle AI have a simple utility function? Is it dangerous?
We have some initial ideas for computable versions of curiosity and controlism (there is not a good word in english for the desire/drive to be in control). They both appear to be simple to specify. Human values are complex but they probably use something like simple curiosity and controlism heuristics as subfeatures.
So a brain-inspired approach could fail if the altruism components don’t work or become de-emphasized later. It could fail if the AI’s circle of empathy/altruism is too small or focused on say an individual (the creator, for example), and the AI then behaves oddly when they die.
At this time I am not aware of a realistic proposal for implementing altruism in a ML based AGI. Maybe it exists and just isn’t well known—if you’ve come across anything send some links.
Well, yes.
I do not believe the demand for or potential of oracle AI is remotely comparable to agentive AI. People will want agents to do their bidding, create wealth for them, help them live better, etc.
Autonomy? Arguably that’s Greek...
There is clearly a demand for agentive AI, in a sense, because people are already using agents to do their bidding, to achieve specific goals. Those qualifications are important because they distinguish a limited kind of AI, that people would want, from a more powerful kind, that they would not.
The idea of AI as “benevolent” dictator is not appealing to democritically minded types, who tend to suspect a slippery slope from benevolence to malevolence, and it is not appealing to dictator to have a superhuman rival...so who is motivated to build one?
Yudkowsky seems to think that there is a moral imperative to put an AI in charge of the world, because it would create billions of extra happy human lives, and not creating those lives is the equivalent of mass murder. That is a very unintuitive piece of reasoning, and it therefore cannot stand as a prediction of what AIs will be built, since it does not stand as a prediction about how people will reason morally.
The option of achieving safety by aiming lower...the technique that leads us to have speed limits, rather than struggling to make the faster possible car safe...is still available.
The God AI concept is related to another favourite MIRI theme, the need to instil the whole of human value into an AI, something MIRI admits would be very difficult. .
MIRI makes the methodological proposal that it simplifies the issue of friendliness or morality or safety to deal with the whole of human value, rather than identifying a morally relevant subset. Having done that, it concludes that human morality is extremely complex. In other words, the payoff in terms of methodological simplification never arrives, for all that MIRI relieves itself of the burden of coming up with a theory of morality. Since dealing with human value in total is in absolute terms very complex, the possibility remains open that identifying the morally relevant subset of values is relatively easier (even if still difficult in absolute terms) than designing an AI to be friendly in terms of the totality of value, particularly since philosophy offers a body of work that seeks to identify simple underlying principles of ethics.
Not only are some human values morally relevant, than others some human values are what make humans dangerous to other humans, bordering on existential threat. I would rather not have superintelligent AIs with paranoia , supreme ambition, or tribal loyalty to other AIs in their value system.
So there are good reasons for thinking that installing subsets of human value would be both easier and safer.
Altruism, in particular is not needed for a limited agentive AI. Such AIs would perform specialised tasks, leaving it to humans to stitch the results into something that fulfils their values. We don’t want a Google car that takes us where it guesses we want to go
From section 5.1.1. of Responses to Catastrophic AGI Risk:
The weaponisation of AI has indeed already begun, so it is not a danger that needs pointing out. It suits the military to give drones, and so forth, greater autonomy, but it also suits the military to retain overall control....they are not going to build a God AI that is also a weapon, since there is no military mileagei n building a weapon that might attack you out of its own volition. So weaponised AI is limited agentive AI. Since the military want .to retain overall control, they will in effect conduct their own safety research, increasing the controlability of their systems in parallel with their increasing autonomy. MIRIs research is not very relevant to weaponised AI, because MIRI focuses on the hidden dangers of apparently benevolent AI, and on god AIs, powerful singletons.
You may be tacitly assuming that an AI is either passive, like Oracle AI , .or dangerously agentive. But we already have agentive AIs that haven’t killed us.
I am making a three way distinction between
Non agentive AI
Limited agentive AI
Maximally agentive AI, .or “God” AI.
Non agentive AI is passive, doing nothing once it has finished processing its current request. It is typified by Oracle AI. Limited agentive AI performs specific functions, and operates under effective overrides and safety protocols. (For instance, whilst it would destroy the effectiveness of automated trading software to have a human okaying each trade, it nonetheless has kill switches and sanity checks). Both are examples of Tool AI. Tool AI can be used to do dangerous things, but the responsibility ultimately falls on the tool us Maximally agentive AI is not passive by default, and has a wide range if capabilities. It may be in charge of other AIs, or have effectors that allow it to take real world actions directly. Attempts may have been made to add safety features, but their effectiveness would be in doubt...thatis just the hard problem of AI friendliness that MIRI writes so much about.
The contrary view is that there is no need to render God AIs safe technologically, because other is no incentive to build them.(Which does not mean the whole field of AI safety is pointless
ETA
On the other hand you may be distinguishing between limited and maximal agency, but arguing that there is a slippery slope leading from the one to the other. The political analogy shows that people are capable of putting a barrier across the slope: people are generally happy to give some power to some politicians, but resist moves to give all the power to one person.
On the other hand, people might be tempted to give AIs more power once they have a track record of reliability, but a track record of reliability is itself a kind of empirical safety proof.
There is a further argument to the effect that we are gradually giving more autonomy to agentive AIs (without moving entirely away from oracle AIs like Google) , but that gradual increase is being paralelled by an incremental approach to AI safety, for instance in automated trading systems, which have been given both more ability to trade without detailed oversight, and more powerful overrides. Hypothetically, increased autonomy without increased safety measures would mean increased danger, but that is not the case in reality. I am not arguing against AI danger and safety measures overall, I am arguing against a grandiose, all-or-nothing conception of AI safety and danger.
I like it.
(Replying to my own text above). On consideration this is wrong—Google is an oracle-AI more or less, and there is high demand for that. The demand for agenty AI is probably much greater, but there is still a role/demand for oracle AI and alot of other stuff in between.
Totally. I think this also goes hand in hand with understanding more about human values—how they evolved, how they are encoded, what is learned or not etc.
Of course—there are many niches for more specialized or limited agentive AI, and these designs probably don’t need altruism. That’s important more for the complex general agents, which would control/manage the specialists, narrow AIs, other software, etc.
That seems to be re introducing God AI. I think people would want to keep humans in the loop. That’s both a prediction, and a means of AI safety.
So if I spouted 100 billion true statements at you, then said, “It would be good for you to give me $100,000,” you’d pay up?
If you just said a bunch of trivial statements 1 billion times, and then demanded to give you money, it would seem extremely suspicious. It does not fit with your pattern of behavior.
If, on the other hand, you gave useful and non-obvious advice, I would do it. Because the demand to give you money wouldn’t seem any different than all the other things you told me to do that worked out.
I mean, that’s the essence of the human concept of earning trust, and betrayal.
Yes, but expecting any reasoner to develop well-grounded abstract concepts without any grounding in features and then care about them is… well, it’s not actually complete bullshit, but expecting it to actually happen relies on solving some problems I haven’t seen solved.
You could, hypothetically, just program your AI to infer “goodness” as a causal-role concept from the vast sums of data it gains about the real world and our human opinions of it, and then “maximize goodness”, formulated as another causal role. But this requires sophisticated machinery for dealing with causal-role concepts, which I haven’t seen developed to that extent in any literature yet.
Usually, reasoners develop causal-role concepts in order to explain what their feature-level concepts are doing, and thus, causal-role concepts abstracted over concepts that don’t eventually root themselves in features are usually dismissed as useless metaphysical speculation, or at least abstract wankery one doesn’t care about.
I don’t think you are responding the the correct comment. Or at least I have no idea what you are talking about.
If those 100 billion true statements were all (or even mostly) useful and better calibrated than my own priors, then I’d be likely to believe you, so yes. On the other hand, if you replace $100,000 with $100,000,000,000, I don’t think that would still hold.
I think you found an important caveat, which is that the fact that an agent will benefit from you believing a statement weakens the evidence that the statement is true, to the point that it’s literally zero for an agent that you don’t trust at all. And if an AI will have a human-like architecture, or even if not, I think that would still hold.
Yes, I would, assuming you don’t mean statements like “1+1 = 2”, but rather true statements spread over a variety of contexts such that I would reasonably believe that you would be trustworthy to that degree over random situations (and thus including such as whether I should give you money.)
(Also, the 100 billion true statements themselves would probably be much more valuable than $100,000).
According to game theory, this opens you to exploitation by an agent that wants your money for its own gain and can generate 100 billion true statements at a little cost.
You may be already doiving this, giving money to people whose claims you believe yoursel