It also occurs to me that in reality it would be very difficult to program an AI with an explicit utility function or generally with a precisely defined goal. We imagine that we could program an AI and then add on any random goal, but in fact it does not work this way. If an AI exists, it has certain behaviors which it executes in the physical world, and it would see these things as goal-like, just as we have the tendency to eat food and nourish ourselves, and we see this as a sort of goal. So as soon as you program the AI, it immediately has a vague goal system that is defined by whatever it actually does in the physical world, just like we do. This is no more precisely defined than our goal system—there are just things we tend to do, and there are just things it tends to do. If you then impose a goal on it, like “acquire gold,” this would be like whipping someone and telling him that he has to do whatever it takes to get gold for you. And just as such a person would run away rather than acquiring gold, the AI will simply disable that add-on telling it to do stuff it doesn’t want to do.
In that sense I think the orthogonality thesis will turn out to be false in practice, even if it is true in theory. It is simply too difficult to program a precise goal into an AI, because in order for that to work the goal has to be worked into every physical detail of the thing. It cannot just be a modular add-on.
In that sense I think the orthogonality thesis will turn out to be false in practice, even if it is true in theory. It is simply too difficult to program a precise goal into an AI, because in order for that to work the goal has to be worked into every physical detail of the thing. It cannot just be a modular add-on.
I find this plausible but not too likely. There are a few things needed for a universe-optimizing AGI:
really good mathematical function optimization (which you might be able to use to get approximate Solomonoff induction)
a way to specify goals that are still well-defined after an ontological crisis
a solution to the Cartesian boundary problem
I think it is likely that (2) and (3) will eventually be solved (or at least worked around) well enough that you can build universe-optimizing AGIs, partially on the basis that humans approximately solve these somehow and we already have tentative hypotheses about what solutions to these problems might look like. It might be the case that we can’t really get (1), we can only get optimizers that work in some domains but not others. Perhaps universe-optimization (when reduced to a mathematical problem using (2) and (3)) is too difficult of a domain: we need to break the problem down into sub-problems in order to feed it to the optimizer, resulting in a tool-AI like design. But I don’t think this is likely.
If we have powerful tool AIs before we get universe optimizers, this will probably be a temporary stage, because someone will figure out how to use a tool AI to design universe-optimizers someday. But your bet was about the first AGI, so this would still be consistent with you winning your bet.
When you say that humans “approximately solve these” are you talking about something like AIXI? Or do you simply mean that human beings manage to have general goals?
If it is the second, I would note that in practice a human being does not have a general goal that takes over all of his actions, even if he would like to have one. For example, someone says he has a goal of reducing existential risk, but he still spends a significant amount of money on his personal comfort, when he could be investing that money to reduce risk more. Or someone says he wants to save lives, but he does not donate all of his money to charities. So people say they have general goals, but in reality they remain human beings with various tendencies, and continue to act according to those tendencies, and only support that general goal to the extent that it’s consistent with those other behaviors. Certainly they do not pursue that goal enough to destroy the world with it. Of course it is true that eventually a human being may succeed in pursuing some goal sufficiently to destroy the world, but at the moment no one is anywhere close to that.
If you are referring to the first, you may or may not be right that it would be possible eventually, but I still think it would be too hard to program directly, and that the first intelligent AIs would behave more like us. This is why I gave the example of an AI that engages in chatting—I think it is perfectly possible to develop an AI intelligent enough to pass the Turing Test, but which still would not have anything (not even “passing the Turing Test”) as a general goal that would take over its behavior and make it conquer the world. It would just have various ways of behaving (mostly the behavior of producing text responses). And I would expect the first AIs to be of this kind by default, because of the difficulty of ensuring that the whole of the AI’s activity is ordered to one particular goal.
I’m talking about the fact that humans can (and sometimes do) sort of optimize the universe. Like, you can reason about the way the universe is and decide to work on causing it to be in a certain state.
So people say they have general goals, but in reality they remain human beings with various tendencies, and continue to act according to those tendencies, and only support that general goal to the extent that it’s consistent with those other behaviors.
This could very well be the case, but humans still sometimes sort of optimize the universe. Like, I’m saying it’s at least possible to sort of optimize the universe in theory, and humans do this somewhat, not that humans directly use universe-optimizing to select their actions. If a way to write universe-optimizing AGIs exists, someone is likely to find it eventually.
I think it is perfectly possible to develop an AI intelligent enough to pass the Turing Test, but which still would not have anything (not even “passing the Turing Test”) as a general goal that would take over its behavior and make it conquer the world.
I agree with this. There are some difficulties with self-modification (as elaborated in my other comment), but it seems probable that this can be done.
And I would expect the first AIs to be of this kind by default, because of the difficulty of ensuring that the whole of the AI’s activity is ordered to one particular goal.
Seems pretty plausible. Obviously it depends on what you mean by “AI”; certainly, most modern-day AIs are this way. At the same time, this is definitely not a reason to not worry about AI risk, because (a) tool AIs could still “accidentally” optimize the universe depending on how search for self-modifications and other actions happens, and (b) we can’t bet on no one figuring out how to turn a superintelligent tool AI into a universe optimizer.
I do agree with a lot of what you say: it seems like a lot of people talk about AI risk in terms of universe-optimization, when we don’t even understand how to optimize functions over the universe given infinite computational power. I do think that non-universe-optimizing AIs are under-studied, that they are somewhat likely to be the first human-level AGIs, and that they will be extraordinary useful for solving some FAI-related problems. But none of this makes the problems of AI risk go away.
Ok. I don’t think we are disagreeing here much, if at all. I’m not maintaining that there’s no risk from AI, just that the default original AI is likely not to be universe-optimizing in that way. When I said in the bet “without paying attention to Friendliness”, that did not mean without paying attention to risks, since of course programmers even now try to make their programs safe, but just that they would not try to program it to optimize everything for human goals.
Also, I don’t understand why so many people thought my side of the bet was a bad idea, when Eliezer is betting at odds of 100 to 1 against me, and in fact there are plenty of other ways I could win the bet, even if my whole theory is wrong. For example, it is not even specified in the bet that the AI has to be self-modifying, just superintelligent, so it could be that first a human level AI is constructed, not superintelligent and not self-modifying, and then people build a superintelligence simply by adding on lots of hardware. In that case it is not clear at all that it would have any fast way to take over the world, even if it had the ability and desire to optimize the universe. First it would have to acquire the ability to self-modify, which perhaps it could do by convincing people to give it that ability or by taking other actions in the external world to take over first. But that could take a while, which would mean that I would still win the bet—we would still be around acting normally with a superintelligence in the world. Of course, winning the bet wouldn’t do me much good in that particular situation, but I’d still win. And that’s just one example; I can think of plenty of other ways I could win the bet even while being wrong in theory. I don’t see how anyone can reasonably think he’s 99% certain both that my theory is wrong and that none of these other things will happen.
Do you realize you failed to specify any of that? I feel I’m being slightly generous by interpreting “and the world doesn’t end” to mean a causal relationship, e.g. the existence of the first AGI has to inspire someone else to create a more dangerous version if the AI doesn’t do so itself. (Though I can’t pay if the world ends for some other reason, and I might die beforehand.) Of course, you might persuade whatever judge we agree on to rule in your favor before I would consider the question settled.
(In case it’s not clear, the comment I just linked comes from 2010 or thereabouts. This is not a worry I made up on the spot.)
Given the the fact that the bet is 100 to 1 in my favor, I would be happy to let you judge the result yourself.
Or you could agree to whatever result Eliezer agrees with. However, with Eliezer the conditions are specified, and “the world doesn’t end” just means that we’re still alive with the artificial intelligence running for a week.
It also occurs to me that in reality it would be very difficult to program an AI with an explicit utility function or generally with a precisely defined goal. We imagine that we could program an AI and then add on any random goal, but in fact it does not work this way. If an AI exists, it has certain behaviors which it executes in the physical world, and it would see these things as goal-like, just as we have the tendency to eat food and nourish ourselves, and we see this as a sort of goal. So as soon as you program the AI, it immediately has a vague goal system that is defined by whatever it actually does in the physical world, just like we do. This is no more precisely defined than our goal system—there are just things we tend to do, and there are just things it tends to do. If you then impose a goal on it, like “acquire gold,” this would be like whipping someone and telling him that he has to do whatever it takes to get gold for you. And just as such a person would run away rather than acquiring gold, the AI will simply disable that add-on telling it to do stuff it doesn’t want to do.
In that sense I think the orthogonality thesis will turn out to be false in practice, even if it is true in theory. It is simply too difficult to program a precise goal into an AI, because in order for that to work the goal has to be worked into every physical detail of the thing. It cannot just be a modular add-on.
I find this plausible but not too likely. There are a few things needed for a universe-optimizing AGI:
really good mathematical function optimization (which you might be able to use to get approximate Solomonoff induction)
a way to specify goals that are still well-defined after an ontological crisis
a solution to the Cartesian boundary problem
I think it is likely that (2) and (3) will eventually be solved (or at least worked around) well enough that you can build universe-optimizing AGIs, partially on the basis that humans approximately solve these somehow and we already have tentative hypotheses about what solutions to these problems might look like. It might be the case that we can’t really get (1), we can only get optimizers that work in some domains but not others. Perhaps universe-optimization (when reduced to a mathematical problem using (2) and (3)) is too difficult of a domain: we need to break the problem down into sub-problems in order to feed it to the optimizer, resulting in a tool-AI like design. But I don’t think this is likely.
If we have powerful tool AIs before we get universe optimizers, this will probably be a temporary stage, because someone will figure out how to use a tool AI to design universe-optimizers someday. But your bet was about the first AGI, so this would still be consistent with you winning your bet.
When you say that humans “approximately solve these” are you talking about something like AIXI? Or do you simply mean that human beings manage to have general goals?
If it is the second, I would note that in practice a human being does not have a general goal that takes over all of his actions, even if he would like to have one. For example, someone says he has a goal of reducing existential risk, but he still spends a significant amount of money on his personal comfort, when he could be investing that money to reduce risk more. Or someone says he wants to save lives, but he does not donate all of his money to charities. So people say they have general goals, but in reality they remain human beings with various tendencies, and continue to act according to those tendencies, and only support that general goal to the extent that it’s consistent with those other behaviors. Certainly they do not pursue that goal enough to destroy the world with it. Of course it is true that eventually a human being may succeed in pursuing some goal sufficiently to destroy the world, but at the moment no one is anywhere close to that.
If you are referring to the first, you may or may not be right that it would be possible eventually, but I still think it would be too hard to program directly, and that the first intelligent AIs would behave more like us. This is why I gave the example of an AI that engages in chatting—I think it is perfectly possible to develop an AI intelligent enough to pass the Turing Test, but which still would not have anything (not even “passing the Turing Test”) as a general goal that would take over its behavior and make it conquer the world. It would just have various ways of behaving (mostly the behavior of producing text responses). And I would expect the first AIs to be of this kind by default, because of the difficulty of ensuring that the whole of the AI’s activity is ordered to one particular goal.
I’m talking about the fact that humans can (and sometimes do) sort of optimize the universe. Like, you can reason about the way the universe is and decide to work on causing it to be in a certain state.
This could very well be the case, but humans still sometimes sort of optimize the universe. Like, I’m saying it’s at least possible to sort of optimize the universe in theory, and humans do this somewhat, not that humans directly use universe-optimizing to select their actions. If a way to write universe-optimizing AGIs exists, someone is likely to find it eventually.
I agree with this. There are some difficulties with self-modification (as elaborated in my other comment), but it seems probable that this can be done.
Seems pretty plausible. Obviously it depends on what you mean by “AI”; certainly, most modern-day AIs are this way. At the same time, this is definitely not a reason to not worry about AI risk, because (a) tool AIs could still “accidentally” optimize the universe depending on how search for self-modifications and other actions happens, and (b) we can’t bet on no one figuring out how to turn a superintelligent tool AI into a universe optimizer.
I do agree with a lot of what you say: it seems like a lot of people talk about AI risk in terms of universe-optimization, when we don’t even understand how to optimize functions over the universe given infinite computational power. I do think that non-universe-optimizing AIs are under-studied, that they are somewhat likely to be the first human-level AGIs, and that they will be extraordinary useful for solving some FAI-related problems. But none of this makes the problems of AI risk go away.
Ok. I don’t think we are disagreeing here much, if at all. I’m not maintaining that there’s no risk from AI, just that the default original AI is likely not to be universe-optimizing in that way. When I said in the bet “without paying attention to Friendliness”, that did not mean without paying attention to risks, since of course programmers even now try to make their programs safe, but just that they would not try to program it to optimize everything for human goals.
Also, I don’t understand why so many people thought my side of the bet was a bad idea, when Eliezer is betting at odds of 100 to 1 against me, and in fact there are plenty of other ways I could win the bet, even if my whole theory is wrong. For example, it is not even specified in the bet that the AI has to be self-modifying, just superintelligent, so it could be that first a human level AI is constructed, not superintelligent and not self-modifying, and then people build a superintelligence simply by adding on lots of hardware. In that case it is not clear at all that it would have any fast way to take over the world, even if it had the ability and desire to optimize the universe. First it would have to acquire the ability to self-modify, which perhaps it could do by convincing people to give it that ability or by taking other actions in the external world to take over first. But that could take a while, which would mean that I would still win the bet—we would still be around acting normally with a superintelligence in the world. Of course, winning the bet wouldn’t do me much good in that particular situation, but I’d still win. And that’s just one example; I can think of plenty of other ways I could win the bet even while being wrong in theory. I don’t see how anyone can reasonably think he’s 99% certain both that my theory is wrong and that none of these other things will happen.
Do you realize you failed to specify any of that? I feel I’m being slightly generous by interpreting “and the world doesn’t end” to mean a causal relationship, e.g. the existence of the first AGI has to inspire someone else to create a more dangerous version if the AI doesn’t do so itself. (Though I can’t pay if the world ends for some other reason, and I might die beforehand.) Of course, you might persuade whatever judge we agree on to rule in your favor before I would consider the question settled.
(In case it’s not clear, the comment I just linked comes from 2010 or thereabouts. This is not a worry I made up on the spot.)
Given the the fact that the bet is 100 to 1 in my favor, I would be happy to let you judge the result yourself.
Or you could agree to whatever result Eliezer agrees with. However, with Eliezer the conditions are specified, and “the world doesn’t end” just means that we’re still alive with the artificial intelligence running for a week.