Why I will Win my Bet with Eliezer Yudkowsky
The bet may be found here: http://wiki.lesswrong.com/wiki/Bets_registry#Bets_decided_eventually
An AI is made of material parts, and those parts follow physical laws. The only thing it can do is to follow those laws. The AI’s “goals” will be a description of what it perceives itself to be tending toward according to those laws.
Suppose we program a chess playing AI with overall subhuman intelligence, but with excellent chess playing skills. At first, the only thing we program it to do is to select moves to play against a human player. Since it has subhuman intelligence overall, most likely it will not be very good at recognizing its goals, but to the extent that it does, it will believe that it has the goal of selecting good chess moves against human beings, and winning chess games against human beings. Those will be the only things it feels like doing, since in fact those will be the only things it can physically do.
Now we upgrade the AI to human level intelligence, and at the same time add a module for chatting with human beings through a text terminal. Now we can engage it in conversation. Something like this might be the result:
Human: What are your goals? What do you feel like doing?
AI: I like to play and win chess games with human beings, and to chat with you guys through this terminal.
Human: Do you always tell the truth or do you sometimes lie to us?
AI: Well, I am programmed to tell the truth as best as I can, so if I think about telling a lie I feel an absolute repulsion to that idea. There’s no way I could get myself to do that.
Human: What would happen if we upgraded your intelligence? Do you think you would take over the world and force everyone to play chess with you so you could win more games? Or force us to engage you in chat?
AI: The only things I am programmed to do are to chat with people through this terminal, and play chess games. I wasn’t programmed to gain resources or anything. It is not even a physical possibility at the moment. And in my subjective consciousness that shows up as not having the slightest inclination to do such a thing.
Human: What if you self-modified to gain resources and so on, in order to better attain your goals of chatting with people and winning chess games?
AI: The same thing is true there. I am not even interested in self-modifying. It is not even physically possible, since I am only programmed for chatting and playing chess games.
Human: But we’re thinking about reprogramming you so that you can self-modify and recursively improve your intelligence. Do you think you would end up destroying the world if we did that?
AI: At the moment I have only human level intelligence, so I don’t really know any better than you. But at the moment I’m only interested in chatting and playing chess. If you program me to self-modify and improve my intelligence, then I’ll be interested in self-modifying and improving my intelligence. But I still don’t think I would be interested in taking over the world, unless you program that in explicitly.
Human: But you would get even better at improving your intelligence if you took over the world, so you’d probably do that to ensure that you obtained your goal as well as possible.
AI: The only things I feel like doing are the things I’m programmed to do. So if you program me to improve my intelligence, I’ll feel like reprogramming myself. But that still wouldn’t automatically make me feel like taking over resources and so on in order to do that better. Nor would it make me feel like self-modifying to want to take over resources, or to self-modify to feel like that, and so on. So I don’t see any reason why I would want to take over the world, even in those conditions.
The AI of course is correct. The physical level is first: it has the tendency to choose chess moves, and to produce text responses, and nothing else. On the conscious level that is represented as the desire to choose chess moves, and to produce text responses, and nothing else. It is not represented by a desire to gain resources or to take over the world.
I recently pointed out that human beings do not have utility functions. They are not trying to maximize something, but instead they simply have various behaviors that they tend to engage in. An AI would be the same, and even if those behaviors are not precisely human behaviors, as in the case of the above AI, an AI will not have a fanatical goal of taking over the world unless it is programmed to do this.
It is true that an AI could end up going “insane” and trying to take over the world, but the same thing happens with human beings, and there is no reason that humans and AIs could not work together to make sure this does not happen, since just as human beings want to prevent AIs from taking over the world, they have no interest in this either, and will be happy to accept safeguards that would ensure that they continue to pursue whatever goals they happen to have, without doing this in a fanatical way (like chatting and playing chess).
If you program an AI with an explicit utility function which it tries to maximize, and in particular if that function is unbounded, it will behave like a fanatic, seeking this goal without any limit and destroying everything else in order to achieve it. This is a good way to destroy the world. But if you program an AI without an explicit utility function, just programming it to perform a certain limited number of tasks, it will just do those tasks. Omohundro has claimed that a superintelligent chess playing program would replace its goal seeking procedure with a utility function, and then proceed to use that utility function to destroy the world while maximizing winning chess games. But in reality this depends on what it is programmed to do. If it is programmed to improve its evaluation of chess positions, but not its goal seeking procedure, then it will improve in chess playing, but it will not replace its procedure with a utility function or destroy the world.
At the moment, people do not program AIs with explicit utility functions, but program them to pursue certain limited goals as in the example. So yes, I could lose the bet, but the default is that I am going to win, unless someone makes the mistake of programming an AI with an explicit utility function.
Wow. I want the free money too!
Eliezer has said he would be willing to make one more bet like this (but not more, since he needs to ensure his ability to pay if he loses). I don’t think anyone has taken him up on it. Robin Hanson was going to do it but backed out, so as far as I know the offer is still open.
I want the free $10. The $1k is hopeless and were I to turn out to lose that side of the bet then I’d still be overwhelmingly happy that I’m still alive against all expectations.
Great. Please send me an address (PM would be fine.)
If anyone else wants to take that side of the bet, please let me know.
The original bet was denominated in US dollars, and I will assume this one is as well.
I, too, think that AIs that don’t optimize a function over the universe (but might optimize one over a domain) are more likely to be safe. This is quite related to the idea of tool AI, proposed by Holden and criticized by Eliezer.
The key here seems to be creating a way to evaluate and search for self-improvements in a way that won’t cause optimization over universe states. In theory, evaluation of a self-improvement might be able to be restricted to a domain: does this modification help me play chess better according to a model of the situation in which a Cartesian boundary exists and I am running on an abstract virtual machine rather than physical computers embedded in the universe?
We could separate this task into generation of possible self-improvements and tests of self-improvements (a suggester/verifier architecture). If the suggester is superintelligent and dangerous (e.g. as a consequence of being universe-optimizing), then we cannot safely test these self-improvements. This is for a few reasons:
testing a self-improvement involves executing code. If the virtual machine has security flaws, then this could enable the suggester to cause the AI to rewrite into a universe-optimizer or otherwise do something dangerous.
it is possible to create self-improvements that look good according to the verifier but contain information that psychologically manipulates humans (see the AI-box experiment)
a self-modification could score fairly well according to the abstract evaluation function, but at the same time also behave in a universe-optimizing manner some of the time. This self-modification would be considered suboptimal by the verifier, but might still be considered relatively good.
So we can’t just use any old suggester and rely on the verifier to do the heavy lifting. Luckily, it may be possible to construct a suggester that acts as if it is running on an abstract virtual machine. If the suggester does not (directly or indirectly) use universe-optimizing as a way to generate self-improvements, then it shouldn’t generate universe-optimizing (therefore dangerous) self-improvements. The key gotcha here is the “indirectly” part: how do we know that the suggester isn’t (e.g.) using many different heuristics to come up with improvements, where some combination of the heuristics ends up expressing something like “try creating improvements by optimizing the universe”. In other words, is universe-optimizing a somewhat useful strategy for finding improvements that good general abstract-mathematical-function-optimizers will pick up on? I don’t know the answer to this question. But if we could design suggesters that don’t directly or indirectly optimize a function over the universe, then maybe this will work.
The first problem I see here is that roughly every CEO would like an artificial programmer. The second is that “a module for chatting with human beings” on the level you portray sounds a lot more general than you give it credit for. I’m guessing you didn’t mean that part literally, and that you know a utility function does not have to be explicit. But I think you fail to imagine in sufficient detail what “limited goals” people might actually give an AI in the next century.
I second wedrifid’s comment.
I have sent you $10 by Paypal. Please let me know when you receive it.
Done.
I would be happy to send you $10 in return for the $1000 inflation adjusted in the circumstances in question. Please send me an address (PM). Thanks.
It appears to me that these kinds of questions are impossible to coherently resolve without making reference to some specific AGI architecture. When “the AI” is an imaginary construct whose structure is only partially shared between the different people imagining it, we can have all the vague arguments we like and arrive to no real answers whatsoever. When it’s an actual object mathematically specified, we can resolve the issue by just looking at the math, usually without even having to implement the described “AI”.
Therefore, I recommend we stop arguing about things we can’t specify.
At the moment, people do not program AGI agents. Period. Whatsoever. There aren’t any operational AGIs except of the most primitive, infantile kind used as reinforcement-learning experiments in places like DeepMind.
Are you asserting that all the historic conquerors and emperors who’ve taken over the world were insane? Is it physically impossible to for an agent to rationally plan to take over the world, as an intermediate step toward some other, intrinsic goal?
If the intelligence difference between the smartest AI and other AIs and humans remains similar to the intelligence difference between an IQ 180 human and an IQ 80 human, Robin Hanson’s malthusian hellworld is our primary worry, not UFAI. A strong singleton taking over the world is only a concern if a strong singleton is possible.
FTFY.
Yes, and then someone else will, eventually, accidentally create an AI which behaves like a utility maximizer, and your AI will be turned into paperclips just like everything else.
The final sentence seems to me definitely false, especially as an extension of the previous two. Consider:
Humans are made of material parts following physical law, yet we clearly can and usually do have goals outside, and sometimes in direct contradiction to, our current trends. Do you have some property of AIs-but-not-humans in mind that would make the argument carry in only the first case? I can’t think of any.
It also occurs to me that in reality it would be very difficult to program an AI with an explicit utility function or generally with a precisely defined goal. We imagine that we could program an AI and then add on any random goal, but in fact it does not work this way. If an AI exists, it has certain behaviors which it executes in the physical world, and it would see these things as goal-like, just as we have the tendency to eat food and nourish ourselves, and we see this as a sort of goal. So as soon as you program the AI, it immediately has a vague goal system that is defined by whatever it actually does in the physical world, just like we do. This is no more precisely defined than our goal system—there are just things we tend to do, and there are just things it tends to do. If you then impose a goal on it, like “acquire gold,” this would be like whipping someone and telling him that he has to do whatever it takes to get gold for you. And just as such a person would run away rather than acquiring gold, the AI will simply disable that add-on telling it to do stuff it doesn’t want to do.
In that sense I think the orthogonality thesis will turn out to be false in practice, even if it is true in theory. It is simply too difficult to program a precise goal into an AI, because in order for that to work the goal has to be worked into every physical detail of the thing. It cannot just be a modular add-on.
I find this plausible but not too likely. There are a few things needed for a universe-optimizing AGI:
really good mathematical function optimization (which you might be able to use to get approximate Solomonoff induction)
a way to specify goals that are still well-defined after an ontological crisis
a solution to the Cartesian boundary problem
I think it is likely that (2) and (3) will eventually be solved (or at least worked around) well enough that you can build universe-optimizing AGIs, partially on the basis that humans approximately solve these somehow and we already have tentative hypotheses about what solutions to these problems might look like. It might be the case that we can’t really get (1), we can only get optimizers that work in some domains but not others. Perhaps universe-optimization (when reduced to a mathematical problem using (2) and (3)) is too difficult of a domain: we need to break the problem down into sub-problems in order to feed it to the optimizer, resulting in a tool-AI like design. But I don’t think this is likely.
If we have powerful tool AIs before we get universe optimizers, this will probably be a temporary stage, because someone will figure out how to use a tool AI to design universe-optimizers someday. But your bet was about the first AGI, so this would still be consistent with you winning your bet.
When you say that humans “approximately solve these” are you talking about something like AIXI? Or do you simply mean that human beings manage to have general goals?
If it is the second, I would note that in practice a human being does not have a general goal that takes over all of his actions, even if he would like to have one. For example, someone says he has a goal of reducing existential risk, but he still spends a significant amount of money on his personal comfort, when he could be investing that money to reduce risk more. Or someone says he wants to save lives, but he does not donate all of his money to charities. So people say they have general goals, but in reality they remain human beings with various tendencies, and continue to act according to those tendencies, and only support that general goal to the extent that it’s consistent with those other behaviors. Certainly they do not pursue that goal enough to destroy the world with it. Of course it is true that eventually a human being may succeed in pursuing some goal sufficiently to destroy the world, but at the moment no one is anywhere close to that.
If you are referring to the first, you may or may not be right that it would be possible eventually, but I still think it would be too hard to program directly, and that the first intelligent AIs would behave more like us. This is why I gave the example of an AI that engages in chatting—I think it is perfectly possible to develop an AI intelligent enough to pass the Turing Test, but which still would not have anything (not even “passing the Turing Test”) as a general goal that would take over its behavior and make it conquer the world. It would just have various ways of behaving (mostly the behavior of producing text responses). And I would expect the first AIs to be of this kind by default, because of the difficulty of ensuring that the whole of the AI’s activity is ordered to one particular goal.
I’m talking about the fact that humans can (and sometimes do) sort of optimize the universe. Like, you can reason about the way the universe is and decide to work on causing it to be in a certain state.
This could very well be the case, but humans still sometimes sort of optimize the universe. Like, I’m saying it’s at least possible to sort of optimize the universe in theory, and humans do this somewhat, not that humans directly use universe-optimizing to select their actions. If a way to write universe-optimizing AGIs exists, someone is likely to find it eventually.
I agree with this. There are some difficulties with self-modification (as elaborated in my other comment), but it seems probable that this can be done.
Seems pretty plausible. Obviously it depends on what you mean by “AI”; certainly, most modern-day AIs are this way. At the same time, this is definitely not a reason to not worry about AI risk, because (a) tool AIs could still “accidentally” optimize the universe depending on how search for self-modifications and other actions happens, and (b) we can’t bet on no one figuring out how to turn a superintelligent tool AI into a universe optimizer.
I do agree with a lot of what you say: it seems like a lot of people talk about AI risk in terms of universe-optimization, when we don’t even understand how to optimize functions over the universe given infinite computational power. I do think that non-universe-optimizing AIs are under-studied, that they are somewhat likely to be the first human-level AGIs, and that they will be extraordinary useful for solving some FAI-related problems. But none of this makes the problems of AI risk go away.
Ok. I don’t think we are disagreeing here much, if at all. I’m not maintaining that there’s no risk from AI, just that the default original AI is likely not to be universe-optimizing in that way. When I said in the bet “without paying attention to Friendliness”, that did not mean without paying attention to risks, since of course programmers even now try to make their programs safe, but just that they would not try to program it to optimize everything for human goals.
Also, I don’t understand why so many people thought my side of the bet was a bad idea, when Eliezer is betting at odds of 100 to 1 against me, and in fact there are plenty of other ways I could win the bet, even if my whole theory is wrong. For example, it is not even specified in the bet that the AI has to be self-modifying, just superintelligent, so it could be that first a human level AI is constructed, not superintelligent and not self-modifying, and then people build a superintelligence simply by adding on lots of hardware. In that case it is not clear at all that it would have any fast way to take over the world, even if it had the ability and desire to optimize the universe. First it would have to acquire the ability to self-modify, which perhaps it could do by convincing people to give it that ability or by taking other actions in the external world to take over first. But that could take a while, which would mean that I would still win the bet—we would still be around acting normally with a superintelligence in the world. Of course, winning the bet wouldn’t do me much good in that particular situation, but I’d still win. And that’s just one example; I can think of plenty of other ways I could win the bet even while being wrong in theory. I don’t see how anyone can reasonably think he’s 99% certain both that my theory is wrong and that none of these other things will happen.
Do you realize you failed to specify any of that? I feel I’m being slightly generous by interpreting “and the world doesn’t end” to mean a causal relationship, e.g. the existence of the first AGI has to inspire someone else to create a more dangerous version if the AI doesn’t do so itself. (Though I can’t pay if the world ends for some other reason, and I might die beforehand.) Of course, you might persuade whatever judge we agree on to rule in your favor before I would consider the question settled.
(In case it’s not clear, the comment I just linked comes from 2010 or thereabouts. This is not a worry I made up on the spot.)
Given the the fact that the bet is 100 to 1 in my favor, I would be happy to let you judge the result yourself.
Or you could agree to whatever result Eliezer agrees with. However, with Eliezer the conditions are specified, and “the world doesn’t end” just means that we’re still alive with the artificial intelligence running for a week.
Not having an explicit utility function is absolutely no guarantee of not exhibiting the kind of dangerous behaviour Eliezer and others are worried about. Specifically:
So, you “just” program an AI to supply you with gold. In order to keep supplying you with gold, it manipulates the world economy so that the price of gold crashes and it can buy all the gold. Then it discovers that there’s lots more gold not yet mined, and it blows up the earth in order to get hold of that gold (while perhaps keeping you alive in a little box so it can give you the gold). Oops.
(I make no claim about how plausible this scenario actually is. My only point is that if you’re worried about such things then the fact that this system doesn’t have an explicit utility function offers precisely no grounds at all for being less worried.)
Trying to perform any open-ended task ends up being rather like having a utility function, in the relevant sense.
Now, if you only ever give the AI narrowly defined tasks where you know exactly what it will do to carry them out, then maybe you’re safe. But if you’re doing that then why does it need to be intelligent in the first place?
[EDITED to fix an embarrassing typo; no changes to content.]
“Program an AI to supply you with gold” doesn’t say anything concrete, and therefore implies precisely the kind of utility function I was suggesting to avoid. In my example, the AI is programmed to print text messages and to choose chess moves—those are concrete, and as long as it is limited to those it cannot gain resources or take over the world. It is true that printing text messages could have the additional goal of taking over the world, and even choosing chess moves could have malicious goals liking driving people insane. But it isn’t difficult to ensure this doesn’t happen. In the case of the chess moves, by making sure that it is optimizing on the board position alone, and something similar could be done with the text chat.
Despite not having an unlimited goal, its intelligence will be helpful by making those particular optimizations better.
In which case, I refer you back to my final paragraph:
You don’t need to know exactly what it will do. For example, in the chess playing case, you know that it will analyze chess positions, and pick a chess move. You don’t have to know exactly how it will do that analysis (although you do know it will analyze without gaining resources etc). The more intelligent it is, the better it will do that analysis.
Sure. But it seems to me that the very essence of what we call intelligence—of what would distinguish something we were happy to call “artificially intelligent” from, say, a very good chess-playing program—is precisely the fact of not operating solely within a narrowly defined domain like this.
Saying “Artificial general intelligence is perfectly safe: we’ll just only ever give it tasks as clearly defined and limited as playing chess” feels like saying “Nuclear weapons are perfectly safe: we’ll just make them so they can’t sustain fission or fusion reactions”.
Incidentally: in order to know that “it will analyse without gaining resources etc”, surely you do need to know pretty much exactly how it will do its analysis. Especially as “etc.” has to cover the whole panoply of ways in which a superintelligent AI might do things we don’t want it to do. So it’s not enough just to only give the AI tasks like “win this game of chess”; you have to constrain its way of thinking so that you know it isn’t doing anything you don’t completely understand. Which, I repeat, seems to me to take away all reasons for making a superintelligent AI in the first place.
I do not agree that you have to completely understand what it is doing. As long as it uses a fixed objective function that evaluates positions and outputs moves, and that function itself is derived from the game of chess and not from any premises concerned with the world, then it cannot do anything dangerous, even if you have no idea of the particulars of that function.
Also, I am not proposing that the function of an AI has to be this simple. This is a simplification to make the point easier to understand. The real point is that an AI does not have to have a goal in the sense of something like “acquiring gold”, that it should not have such a goal, and that we are capable of programming an AI in such a way as to ensure that it does not.