What the “dumb superintelligence” argument tries to highlight is that if humans are incapable of predicting such behavior, then they will also be mistaken about predicting behavior that is harmful to the robots power.
To some extent. Perhaps it would be helpful to distinguish four different kinds of defeater:
early intelligence defeater: We try to build a seed AI, but our self-rewriting AI quickly hits a wall or explodes. This is most likely if we start with a subhuman intelligence and have serious resource constraints (so we can’t, e.g., just run an evolutionary algorithm over millions of copies of the AGI until we happen upon a variant that works).
late intelligence defeater: The seed AI works just fine, but at some late stage, when it’s already at or near superintelligence, it suddenly explodes. Apparently it went down a blind alley at some point early on that led it to plateau or self-destruct later on, and neither it nor humanity is smart enough yet to figure out where exactly the problem arose. So the FOOM fizzles.
early Friendliness defeater: From the outset, the seed AI’s behavior already significantly diverges from Friendliness.
late Friendliness defeater: The seed AI starts off as a reasonable approximation of Friendliness, but as it approaches superintelligence its values diverge from anything we’d consider Friendly, either because it wasn’t previously smart enough to figure out how to self-modify while keeping its values stable, or because it was never perfectly Friendly and the new circumstances its power puts it in now make the imperfections much more glaring.
In general, late defeaters are much harder for humans to understand than early defeaters, because an AI undergoing FOOM is too fast and complex to be readily understood. Your three main arguments, if I’m understanding them, have been:
(a) Early intelligence defeaters are so numerous that there’s no point thinking much about other kinds of defeaters yet.
(b) Friendliness defeaters imply a level of incompetence on the programmers’ part that strongly suggest intelligence defeaters will arise in the same situation.
(c) If an initially somewhat-smart AI is smart enough to foresee and avoid late intelligence defeaters, then an initially somewhat-nice AI should be smart enough to foresee and avoid late Friendliness defeaters.
I reject (a), because I haven’t seen any specific reason a self-improving AGI will be particularly difficult to make FOOM—‘it would require lots of complicated things to happen’ is very nearly a fully general argument against any novel technology, so I can’t get very far on that point alone. I accept (b), at least for a lot of early defeaters. But my concern is that while non-Friendliness predicts non-intelligence (and non-intelligence predicts non-Friendliness), intelligence also predicts non-Friendliness.
But our interesting disagreement seems to be over (c). Interesting because it illuminates general differences between the basic idea of a domain-general optimization process (intelligence) and the (not-so-)basic idea of Everything Humans Like. One important difference is that if an AGI optimizes for anything, it will have strong reason to steer clear of possible late intelligence defeaters. Late Friendliness defeaters, on the other hand, won’t scare optimization-process-optimizers in general.
It’s easy to see in advance that most beings that lack obvious early Friendliness defeaters will nonetheless have late Friendliness defeaters. In contrast, it’s much less clear that most beings lacking early intelligence defeaters will have late intelligence defeaters. That’s extremely speculative at this point—we simply don’t know what sorts of intelligence-destroying attractors might exist out there, or what sorts of paradoxes and complications are difficult v. trivial to overcome.
there is no relevant difference between failing to predict behavior that will harm the robot itself, and behavior that will harm a human. Except that you might believe the former is much easier than the latter. I dispute this.
But, once again, it doesn’t take any stupidity on the AI’s part to disvalue physically injuring a human, even if it does take stupidity to not understand that one is physically injuring a human. It only takes a different value system. Valuing one’s own survival is not orthogonal to valuing becoming more intelligent; but valuing human survival is orthogonal to valuing becoming more intelligent. (Indeed, to the extent they aren’t orthogonal it’s because valuing becoming more intelligent tends to imply disvaluing human survival, because humans are hard to control and made of atoms that can be used for other ends, including increased computing power.) This is the whole point of the article we’re commenting on.
Your three main arguments, if I’m understanding them, have been:
Here is part of my stance towards AI risks:
1. I assign a negligible probability to the possibility of a sudden transition from well-behaved narrow AIs to general AIs (see below).
2. An AI will not be pulled at random from mind design space. An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase and subsequent market pressure.
3. Commercial, research or military products are created with efficiency in mind. An AI that was prone to take unbounded actions given any terminal goal would either be fixed or abandoned during the early stages of research. If early stages showed that inputs such as the natural language query would yield results such as then the AI would never reach a stage in which it was sufficiently clever and trained to understand what results would satisfy its creators in order to deceive them.
4. I assign a negligible probability to the possibility of a consequentialist AI / expected utility maximizer / approximation to AIXI.
Given that the kind of AIs from point 4 are possible:
5. Omohundro’s AI drives are what make the kind of AIs mentioned in point 1 dangerous. Making an AI that does not exhibit these drives in an unbounded manner is probably a prerequisite to get an AI to work at all (there are not enough resources to think about being obstructed by simulator gods etc.), or should otherwise be easy compared to the general difficulties involved in making an AI work using limited resources.
6. An AI from point 4 will only ever do what it has been explicitly programmed to do. Such an AI is not going to protect its utility-function, acquire resources or preemptively eliminate obstacles in an unbounded fashion. Because it is not intrinsically rational to do so. What specifically constitutes rational, economic behavior is inseparable with an agent’s terminal goal. That any terminal goal can be realized in an infinite number of ways implies an infinite number of instrumental goals to choose from.
7. Unintended consequences are by definition not intended. They are not intelligently designed but detrimental side effects, failures. Whereas intended consequences, such as acting intelligently, are intelligently designed. If software was not constantly improved to be better at doing what humans intend it to do we would never be able to reach a level of sophistication where a software could work well enough to outsmart us. To do so it would have to work as intended along a huge number of dimensions. For an AI to constitute a risk as a result of unintended consequences those unintended consequences would have to have no, or little, negative influence on the huge number of intended consequences that are necessary for it to be able to overpower humanity.
I haven’t seen any specific reason a self-improving AGI will be particularly difficult to make FOOM...
I am not yet at a point of my education where I can say with confidence that this is the wrong way to think, but I do believe it is.
If someone walked up to you and told you about a risk only he can solve, and that you should therefore give this person money, would you give him money because you do not see any specific reason for why he could be wrong? Personally I would perceive the burden of proof to be on him to show me that the risk is real.
Despite this, I have specific reasons to personally believe that the kind of AI you have in mind is impossible. I have thought about such concepts as consequentialism / expected utility maximization, and do not see that they could be made to work, other than under very limited circumstances. And I also asked other people, outside of LessWrong, who are more educated and smarter than me, and they also told me that these kind of AIs are not feasible, they are uncomputable.
But our interesting disagreement seems to be over (c).
I am not sure I understand what you mean by c. I don’t think I agree with it.
One important difference is that if an AGI optimizes for anything,
I don’t know what this means.
Valuing one’s own survival is not orthogonal to valuing becoming more intelligent; but valuing human survival is orthogonal to valuing becoming more intelligent.
That this black box you call “intelligence” might be useful to achieve a lot of goals is not an argument in support of humans wanting to and succeeding at the implementation of “value to maximize intelligence” in conjunction with “by all means”.
Most definitions of intelligence that I am aware of are in terms of the ability to achieve goals. Saying that a system values to become more intelligent then just means that a system values to increase its ability to achieve its goals. In this context, what you suggest is that humans will want to, and will succeed to, implement an AI that in order to beat humans at Tic-tac-toe is first going to take over the universe and make itself capable of building such things as Dyson spheres.
What I am saying is that it is much easier to create a Tic-tac-toe playing AI, or an AI that can earn a university degree, than the former in conjunction with being able to take over the universe and build Dyson spheres.
The argument that valuing not to kill humans is orthogonal to taking over the universe and building Dyson spheres is completely irrelevant.
An AI will not be pulled at random from mind design space.
I don’t think anyone’s ever disputed this. (However, that’s not very useful if the deterministic process resulting in the SI is too complex for humans to distinguish it in advance from the outcome of a random walk.)
An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase and subsequent market pressure.
Agreed. But by default, a machine that is better than other rival machines at satisfying our short-term desires will not satisfy our long-term desires. The concern isn’t that we’ll suddenly start building AIs with the express purpose of hitting humans in the face with mallets. The concern is that we’ll code for short-term rather than long-term goals, due to a mixture of disinterest in Friendliness and incompetence at Friendliness. But if intelligence explosion occurs, ‘the long run’ will arrive very suddenly, and very soon. So we need to adjust our research priorities to more seriously assess and modulate the long-term consequences of our technology.
An AI that was prone to take unbounded actions given any terminal goal would either be fixed or abandoned during the early stages of research.
That may be a reason to think that recursively self-improving AGI won’t occur. But it’s not a reason to expect such AGI, if it occurs, to be Friendly.
If early stages showed that inputs such as the natural language query would yield results such as
The seed is not the superintelligence. We shouldn’t expect the seed to automatically know whether the superintelligence will be Friendly, any more than we should expect humans to automatically know whether the superintelligence will be Friendly.
Making an AI that does not exhibit these drives in an unbounded manner is probably a prerequisite to get an AI to work at all (there are not enough resources to think about being obstructed by simulator gods etc.)
I’m not following. Why does an AGI have to have a halting condition (specifically, one that actually occurs at some point) in order to be able to productively rewrite its own source code?
An AI from point 4 will only ever do what it has been explicitly programmed to do.
You don’t seem to be internalizing my arguments. This is just the restatement of a claim I pointed out was not just wrong but dishonestly statedhere.
That any terminal goal can be realized in an infinite number of ways implies an infinite number of instrumental goals to choose from.
Sure, but the list of instrumental goals overlap more than the list of terminal goals, because energy from one project can be converted to energy for a different project. This is an empirical discovery about our world; we could have found ourselves in the sort of universe where instrumental goals don’t converge that much, e.g., because once energy’s been locked down into organisms or computer chips you just Can’t convert it into useful work for anything else. In a world where we couldn’t interfere with the AI’s alien goals, nor could our component parts and resources be harvested to build very different structures, nor could we be modified to work for the AI, the UFAI would just ignore us and zip off into space to try and find more useful objects. We don’t live in that world because complicated things can be broken down into simpler things at a net gain in our world, and humans value a specific set of complicated things.
‘These two sets are both infinite’ does not imply ‘we can’t reason about these two things’ relative size, or how often the same elements recur in their elements’.
I am not yet at a point of my education where I can say with confidence that this is the wrong way to think, but I do believe it is.
If someone walked up to you and told you about a risk only he can solve, and that you should therefore give this person money, would you give him money because you do not see any specific reason for why he could be wrong? Personally I would perceive the burden of proof to be on him to show me that the risk is real.
You’ve spent an awful lot of time writing about the varied ways in which you’ve not yet been convinced by claims you haven’t put much time into actively investigating. Maybe some of that time could be better spent researching these topics you keep writing about? I’m not saying to stop talking about this, but there’s plenty of material on a lot of these issues to be found. Have you read Intelligence Explosion Microeconomics?
succeeding at the implementation of “value to maximize intelligence” in conjunction with “by all means”.
As a rule, adding halting conditions adds complexity to an algorithm, rather than removing complexity.
Saying that a system values to become more intelligent then just means that a system values to increase its ability to achieve its goals.
No, this is a serious misunderstanding. Yudkowsky’s definition of ‘intelligence’ is about the ability to achieve goals in general, not about the ability to achieve the system’s goals. That’s why you can’t increase a system’s intelligence by lowering its standards, i.e., making its preferences easier to satisfy.
what you suggest is that humans will want to, and will succeed to, implement an AI that in order to beat humans at Tic-tac-toe is first going to take over the universe and make itself capable of building such things as Dyson spheres.
Straw-man; no one has claimed that humans are likely to want to create an UFAI. What we’ve suggested is that humans are likely to want to create an algorithm, X, that will turn out to be a UFAI. (In other words, the fallacy you’re committing is confusing intension with extension.)
That aside: Are you saying Dyson spheres wouldn’t be useful for beating more humans at more tic-tac-toe games? Seems like a pretty good way to win at tic-tac-toe to me.
Yudkowsky’s definition of ‘intelligence’ is about the ability to achieve goals in general, not about the ability to achieve the system’s goals. That’s why you can’t increase a system’s intelligence by lowering its standards, i.e., making its preferences easier to satisfy.
Actually I do define intelligence as ability to hit a narrow outcome target relative to your own goals, but if your goals are very relaxed then the volume of outcome space with equal or greater utility will be very large. However one would expect that many of the processes involved in hitting a narrow target in outcome space (such that few other outcomes are rated equal or greater in the agent’s preference ordering), such as building a good epistemic model or running on a fast computer, would generalize across many utility functions; this is why we can speak of properties apt to intelligence apart from particular utility functions.
Actually I do define intelligence as ability to hit a narrow outcome target relative to your own goals
Hmm. But this just sounds like optimization power to me. You’ve defined intelligence in the past as “efficient cross-domain optimization”. The “cross-domain” part I’ve taken to mean that you’re able to hit narrow targets in general, not just ones you happen to like. So you can become more intelligent by being better at hitting targets you hate, or by being better at hitting targets you like.
The former are harder to test, but something you’d hate doing now could become instrumentally useful to know how to do later. And your intelligence level doesn’t change when the circumstance shifts which part of your skillset is instrumentally useful. For that matter, I’m missing why it’s useful to think that your intelligence level could drastically shift if your abilities remained constant but your terminal values were shifted. (E.g., if you became pickier.)
No, “cross-domain” means that I can optimize across instrumental domains. Like, I can figure out how to go through water, air, or space if that’s the fastest way to my destination, I am not limited to land like a ground sloth.
Measured intelligence shouldn’t shift if you become pickier—if you could previously hit a point such that only 1/1000th of the space was more preferred than it, we’d still expect you to hit around that narrow a volume of the space given your intelligence even if you claimed afterward that a point like that only corresponded to 0.25 utility on your 0-1 scale instead of 0.75 utility due to being pickier ([expected] utilities sloping more sharply downward with increasing distance from the optimum).
But by default, a machine that is better than other rival machines at satisfying our short-term desires will not satisfy our long-term desires.
You might be not aware of this but I wrote a sequence of short blog posts where I tried to think of concrete scenarios that could lead to human extinction. Each of which raised many questions.
What might seem to appear completely obvious to you for reasons that I do not understand, e.g. that an AI can take over the world, appears to me largely like magic (I am not trying to be rude, by magic I only mean that I don’t understand the details). At the very least there are a lot of open questions. Even given that for the sake of the above posts I accepted that the AI is superhuman and can do such things as deceive humans by its superior knowledge of human psychology. Which seems to be non-trivial assumption, to say the least.
That may be a reason to think that recursively self-improving AGI won’t occur. But it’s not a reason to expect such AGI, if it occurs, to be Friendly.
Over and over I told you that given all your assumptions, I agree that AGI is an existential risk.
The seed is not the superintelligence. We shouldn’t expect the seed to automatically know whether the superintelligence will be Friendly, any more than we should expect humans to automatically know whether the superintelligence will be Friendly.
You did not reply to my argument. My argument was that if the seed is unfriendly then it will not be smart enough to hide its unfriendliness. My argument did not pertain the possibility of a friendly seed turning unfriendly.
Why does an AGI have to have a halting condition (specifically, one that actually occurs at some point) in order to be able to productively rewrite its own source code?
What I have been arguing is that an AI should not be expected, by default, to want to eliminate all possible obstructions. There are many graduations here. That, by some economic or otherwise theoretic argument, it might be instrumentally rational for some ideal AI to take over the world, does not mean that humans would create such an AI, or that an AI could not be limited to care about fires in its server farm rather than that Russia might nuke the U.S. and thereby destroy its servers.
You don’t seem to be internalizing my arguments.
Did you mean to reply to another point? I don’t see how the reply you linked to is relevant to what I wrote.
Sure, but the list of instrumental goals overlap more than the list of terminal goals, because energy from one project can be converted to energy for a different project.
My argument is that an AI does not need to consider all possible threats and care to acquire all possible resources. Based on its design it could just want to optimize using its initial resources while only considering mundane threats. I just don’t see real-world AIs to conclude that they need to take over the world. I don’t think an AI is likely going to be designed that way. I also don’t think such an AI could work, because such inferences would require enormous amounts of resources.
You’ve spent an awful lot of time writing about the varied ways in which you’ve not yet been convinced by claims you haven’t put much time into actively investigating. Maybe some of that time could be better spent researching these topics you keep writing about?
I have done what is possible given my current level of education and what I perceive to be useful. I have e.g. asked experts about their opinion.
A few general remarks about the kind of papers such as the one that you linked to.
How much should I update towards MIRI’s position if I (1) understood the arguments in the paper (2) found the arguments convincing?
My answer is the following. If the paper was about the abc conjecture, the P versus NP problem, climate change, or even such mundane topics as psychology, I would either not be able to understand the paper, would be unable to verify the claims, or would have very little confidence in my judgement.
So what about ‘Intelligence Explosion Microeconomics’? That I can read most of it is only due to the fact that it is very informally written. The topic itself is more difficult and complex than all of the above mentioned problems together. Yet the arguments in support of it, to exaggerate a little bit, contain less rigor than the abstract of one of Shinichi Mochizuki’s papers on the abc conjecture.
Which means that my answer is that I should update very little towards MIRI’s position and that any confidence I gain about MIRI’s position is probably highly unreliable.
Thanks. My feeling is that to gain any confidence into what all this technically means, and to answer all the questions this raises, I’d probably need about 20 years of study.
No, this is a serious misunderstanding. Yudkowsky’s definition of ‘intelligence’ is
Here is part of a post exemplifying how I understand the relation between goals and intelligence:
If a goal has very few constraints then the set that satisfies all constraints is very large. A vague and ambiguous goal allows for too much freedom in the sense that a wide range of world states would have the same expected value and therefore imply a very large solution space, since a wide range of AI’s will be able to achieve those world states and thereby satisfy the condition of being improved versions of their predecessor.
This means that in order to get an AI to become superhuman at all, and very quickly in particular, you will need to encode a very specific goal against which mistakes, optimization power and achievement can be judged.
It is really hard to communicate how I perceive this and other discussions about MIRI’s position without offending people, or killing the discussion.
I am saying this in full honesty. The position you appear to support seems so utterly “complex” (far-fetched) that the current arguments are unconvincing.
Here is my perception of the scenario that you try to sell me (exaggerated to make a point). I have a million questions about it that I can’t answer and which your answers either sidestep or explain away by using “magic”.
At this point I probably made 90% of the people reading this comment incredible angry. My perception is that you cannot communicate this perception on LessWrong without getting into serious trouble. That’s also what I meant when I told you that I cannot be completely honest if you want to discuss this on LessWrong.
I can also assure you that many people who are much smarter and higher status than me think so as well. Many people communicated the absurdity of all this to me but told me that they would not repeat this in public.
My argument was that if the seed is unfriendly then it will not be smart enough to hide its unfriendliness.
Pretending to be friendly when you’re actually not is something that doesn’t even require human level intelligence. You could even do it accidentally.
In general, the appearance of Friendliness at low levels of ability to influence the world doesn’t guarantee actual Friendliness at high levels of ability to influence the world. (If it did, elected politicians would be much higher quality.)
But our interesting disagreement seems to be over (c). Interesting because it illuminates general differences between the basic idea of a domain-general optimization process (intelligence) and the (not-so-)basic idea of Everything Humans Like. One important difference is that if an AGI optimizes for anything, it will have strong reason to steer clear of possible late intelligence defeaters. Late Friendliness defeaters, on the other hand, won’t scare optimization-process-optimizers in general.
But it will scare friendly ones, which will want to keep their values stable.
But, once again, it doesn’t take any stupidity on the AI’s part to disvalue physically injuring a human,
But it will scare friendly ones, which will want to keep their values stable.
Yes. If an AI is Friendly at one stage, then it is Friendly at every subsequent stage. This doesn’t help make almost-Friendly AIs become genuinely Friendly, though.
It takes stupidity to misinterpret friendlienss.
Yes, but that’s stupidity on the part of the human programmer, and/or on the part of the seed AI if we ask it for advice. The superintelligence didn’t write its own utility function; the superintelligence may well understand Friendliness perfectly, but that doesn’t matter if it hasn’t been programmed to rewrite its source code to reflect its best understanding of ‘Friendliness’. The seed is not the superintelligence. See: http://lesswrong.com/lw/igf/the_genie_knows_but_doesnt_care/
Yes, but that’s stupidity on the part of the human programmer, and/or on the part of the seed AI if we ask it for advice.
That depends on the architecture. In a Loosemore architecture, the AI interprets high-level directives itself, so if it gets them wrong, that’s it’s mistake.
To some extent. Perhaps it would be helpful to distinguish four different kinds of defeater:
early intelligence defeater: We try to build a seed AI, but our self-rewriting AI quickly hits a wall or explodes. This is most likely if we start with a subhuman intelligence and have serious resource constraints (so we can’t, e.g., just run an evolutionary algorithm over millions of copies of the AGI until we happen upon a variant that works).
late intelligence defeater: The seed AI works just fine, but at some late stage, when it’s already at or near superintelligence, it suddenly explodes. Apparently it went down a blind alley at some point early on that led it to plateau or self-destruct later on, and neither it nor humanity is smart enough yet to figure out where exactly the problem arose. So the FOOM fizzles.
early Friendliness defeater: From the outset, the seed AI’s behavior already significantly diverges from Friendliness.
late Friendliness defeater: The seed AI starts off as a reasonable approximation of Friendliness, but as it approaches superintelligence its values diverge from anything we’d consider Friendly, either because it wasn’t previously smart enough to figure out how to self-modify while keeping its values stable, or because it was never perfectly Friendly and the new circumstances its power puts it in now make the imperfections much more glaring.
In general, late defeaters are much harder for humans to understand than early defeaters, because an AI undergoing FOOM is too fast and complex to be readily understood. Your three main arguments, if I’m understanding them, have been:
(a) Early intelligence defeaters are so numerous that there’s no point thinking much about other kinds of defeaters yet.
(b) Friendliness defeaters imply a level of incompetence on the programmers’ part that strongly suggest intelligence defeaters will arise in the same situation.
(c) If an initially somewhat-smart AI is smart enough to foresee and avoid late intelligence defeaters, then an initially somewhat-nice AI should be smart enough to foresee and avoid late Friendliness defeaters.
I reject (a), because I haven’t seen any specific reason a self-improving AGI will be particularly difficult to make FOOM—‘it would require lots of complicated things to happen’ is very nearly a fully general argument against any novel technology, so I can’t get very far on that point alone. I accept (b), at least for a lot of early defeaters. But my concern is that while non-Friendliness predicts non-intelligence (and non-intelligence predicts non-Friendliness), intelligence also predicts non-Friendliness.
But our interesting disagreement seems to be over (c). Interesting because it illuminates general differences between the basic idea of a domain-general optimization process (intelligence) and the (not-so-)basic idea of Everything Humans Like. One important difference is that if an AGI optimizes for anything, it will have strong reason to steer clear of possible late intelligence defeaters. Late Friendliness defeaters, on the other hand, won’t scare optimization-process-optimizers in general.
It’s easy to see in advance that most beings that lack obvious early Friendliness defeaters will nonetheless have late Friendliness defeaters. In contrast, it’s much less clear that most beings lacking early intelligence defeaters will have late intelligence defeaters. That’s extremely speculative at this point—we simply don’t know what sorts of intelligence-destroying attractors might exist out there, or what sorts of paradoxes and complications are difficult v. trivial to overcome.
But, once again, it doesn’t take any stupidity on the AI’s part to disvalue physically injuring a human, even if it does take stupidity to not understand that one is physically injuring a human. It only takes a different value system. Valuing one’s own survival is not orthogonal to valuing becoming more intelligent; but valuing human survival is orthogonal to valuing becoming more intelligent. (Indeed, to the extent they aren’t orthogonal it’s because valuing becoming more intelligent tends to imply disvaluing human survival, because humans are hard to control and made of atoms that can be used for other ends, including increased computing power.) This is the whole point of the article we’re commenting on.
Here is part of my stance towards AI risks:
1. I assign a negligible probability to the possibility of a sudden transition from well-behaved narrow AIs to general AIs (see below).
2. An AI will not be pulled at random from mind design space. An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase and subsequent market pressure.
3. Commercial, research or military products are created with efficiency in mind. An AI that was prone to take unbounded actions given any terminal goal would either be fixed or abandoned during the early stages of research. If early stages showed that inputs such as the natural language query would yield results such as then the AI would never reach a stage in which it was sufficiently clever and trained to understand what results would satisfy its creators in order to deceive them.
4. I assign a negligible probability to the possibility of a consequentialist AI / expected utility maximizer / approximation to AIXI.
Given that the kind of AIs from point 4 are possible:
5. Omohundro’s AI drives are what make the kind of AIs mentioned in point 1 dangerous. Making an AI that does not exhibit these drives in an unbounded manner is probably a prerequisite to get an AI to work at all (there are not enough resources to think about being obstructed by simulator gods etc.), or should otherwise be easy compared to the general difficulties involved in making an AI work using limited resources.
6. An AI from point 4 will only ever do what it has been explicitly programmed to do. Such an AI is not going to protect its utility-function, acquire resources or preemptively eliminate obstacles in an unbounded fashion. Because it is not intrinsically rational to do so. What specifically constitutes rational, economic behavior is inseparable with an agent’s terminal goal. That any terminal goal can be realized in an infinite number of ways implies an infinite number of instrumental goals to choose from.
7. Unintended consequences are by definition not intended. They are not intelligently designed but detrimental side effects, failures. Whereas intended consequences, such as acting intelligently, are intelligently designed. If software was not constantly improved to be better at doing what humans intend it to do we would never be able to reach a level of sophistication where a software could work well enough to outsmart us. To do so it would have to work as intended along a huge number of dimensions. For an AI to constitute a risk as a result of unintended consequences those unintended consequences would have to have no, or little, negative influence on the huge number of intended consequences that are necessary for it to be able to overpower humanity.
I am not yet at a point of my education where I can say with confidence that this is the wrong way to think, but I do believe it is.
If someone walked up to you and told you about a risk only he can solve, and that you should therefore give this person money, would you give him money because you do not see any specific reason for why he could be wrong? Personally I would perceive the burden of proof to be on him to show me that the risk is real.
Despite this, I have specific reasons to personally believe that the kind of AI you have in mind is impossible. I have thought about such concepts as consequentialism / expected utility maximization, and do not see that they could be made to work, other than under very limited circumstances. And I also asked other people, outside of LessWrong, who are more educated and smarter than me, and they also told me that these kind of AIs are not feasible, they are uncomputable.
I am not sure I understand what you mean by c. I don’t think I agree with it.
I don’t know what this means.
That this black box you call “intelligence” might be useful to achieve a lot of goals is not an argument in support of humans wanting to and succeeding at the implementation of “value to maximize intelligence” in conjunction with “by all means”.
Most definitions of intelligence that I am aware of are in terms of the ability to achieve goals. Saying that a system values to become more intelligent then just means that a system values to increase its ability to achieve its goals. In this context, what you suggest is that humans will want to, and will succeed to, implement an AI that in order to beat humans at Tic-tac-toe is first going to take over the universe and make itself capable of building such things as Dyson spheres.
What I am saying is that it is much easier to create a Tic-tac-toe playing AI, or an AI that can earn a university degree, than the former in conjunction with being able to take over the universe and build Dyson spheres.
The argument that valuing not to kill humans is orthogonal to taking over the universe and building Dyson spheres is completely irrelevant.
I don’t think anyone’s ever disputed this. (However, that’s not very useful if the deterministic process resulting in the SI is too complex for humans to distinguish it in advance from the outcome of a random walk.)
Agreed. But by default, a machine that is better than other rival machines at satisfying our short-term desires will not satisfy our long-term desires. The concern isn’t that we’ll suddenly start building AIs with the express purpose of hitting humans in the face with mallets. The concern is that we’ll code for short-term rather than long-term goals, due to a mixture of disinterest in Friendliness and incompetence at Friendliness. But if intelligence explosion occurs, ‘the long run’ will arrive very suddenly, and very soon. So we need to adjust our research priorities to more seriously assess and modulate the long-term consequences of our technology.
That may be a reason to think that recursively self-improving AGI won’t occur. But it’s not a reason to expect such AGI, if it occurs, to be Friendly.
The seed is not the superintelligence. We shouldn’t expect the seed to automatically know whether the superintelligence will be Friendly, any more than we should expect humans to automatically know whether the superintelligence will be Friendly.
I’m not following. Why does an AGI have to have a halting condition (specifically, one that actually occurs at some point) in order to be able to productively rewrite its own source code?
You don’t seem to be internalizing my arguments. This is just the restatement of a claim I pointed out was not just wrong but dishonestly stated here.
Sure, but the list of instrumental goals overlap more than the list of terminal goals, because energy from one project can be converted to energy for a different project. This is an empirical discovery about our world; we could have found ourselves in the sort of universe where instrumental goals don’t converge that much, e.g., because once energy’s been locked down into organisms or computer chips you just Can’t convert it into useful work for anything else. In a world where we couldn’t interfere with the AI’s alien goals, nor could our component parts and resources be harvested to build very different structures, nor could we be modified to work for the AI, the UFAI would just ignore us and zip off into space to try and find more useful objects. We don’t live in that world because complicated things can be broken down into simpler things at a net gain in our world, and humans value a specific set of complicated things.
‘These two sets are both infinite’ does not imply ‘we can’t reason about these two things’ relative size, or how often the same elements recur in their elements’.
You’ve spent an awful lot of time writing about the varied ways in which you’ve not yet been convinced by claims you haven’t put much time into actively investigating. Maybe some of that time could be better spent researching these topics you keep writing about? I’m not saying to stop talking about this, but there’s plenty of material on a lot of these issues to be found. Have you read Intelligence Explosion Microeconomics?
http://wiki.lesswrong.com/wiki/Optimization_process
As a rule, adding halting conditions adds complexity to an algorithm, rather than removing complexity.
No, this is a serious misunderstanding. Yudkowsky’s definition of ‘intelligence’ is about the ability to achieve goals in general, not about the ability to achieve the system’s goals. That’s why you can’t increase a system’s intelligence by lowering its standards, i.e., making its preferences easier to satisfy.
Straw-man; no one has claimed that humans are likely to want to create an UFAI. What we’ve suggested is that humans are likely to want to create an algorithm, X, that will turn out to be a UFAI. (In other words, the fallacy you’re committing is confusing intension with extension.)
That aside: Are you saying Dyson spheres wouldn’t be useful for beating more humans at more tic-tac-toe games? Seems like a pretty good way to win at tic-tac-toe to me.
Actually I do define intelligence as ability to hit a narrow outcome target relative to your own goals, but if your goals are very relaxed then the volume of outcome space with equal or greater utility will be very large. However one would expect that many of the processes involved in hitting a narrow target in outcome space (such that few other outcomes are rated equal or greater in the agent’s preference ordering), such as building a good epistemic model or running on a fast computer, would generalize across many utility functions; this is why we can speak of properties apt to intelligence apart from particular utility functions.
Hmm. But this just sounds like optimization power to me. You’ve defined intelligence in the past as “efficient cross-domain optimization”. The “cross-domain” part I’ve taken to mean that you’re able to hit narrow targets in general, not just ones you happen to like. So you can become more intelligent by being better at hitting targets you hate, or by being better at hitting targets you like.
The former are harder to test, but something you’d hate doing now could become instrumentally useful to know how to do later. And your intelligence level doesn’t change when the circumstance shifts which part of your skillset is instrumentally useful. For that matter, I’m missing why it’s useful to think that your intelligence level could drastically shift if your abilities remained constant but your terminal values were shifted. (E.g., if you became pickier.)
No, “cross-domain” means that I can optimize across instrumental domains. Like, I can figure out how to go through water, air, or space if that’s the fastest way to my destination, I am not limited to land like a ground sloth.
Measured intelligence shouldn’t shift if you become pickier—if you could previously hit a point such that only 1/1000th of the space was more preferred than it, we’d still expect you to hit around that narrow a volume of the space given your intelligence even if you claimed afterward that a point like that only corresponded to 0.25 utility on your 0-1 scale instead of 0.75 utility due to being pickier ([expected] utilities sloping more sharply downward with increasing distance from the optimum).
You might be not aware of this but I wrote a sequence of short blog posts where I tried to think of concrete scenarios that could lead to human extinction. Each of which raised many questions.
The introductory post is ‘AI vs. humanity and the lack of concrete scenarios’.
1. Questions regarding the nanotechnology-AI-risk conjunction
2. AI risk scenario: Deceptive long-term replacement of the human workforce
3. AI risk scenario: Social engineering
4. AI risk scenario: Elite Cabal
5. AI risk scenario: Insect-sized drones
6. AI risks scenario: Biological warfare
What might seem to appear completely obvious to you for reasons that I do not understand, e.g. that an AI can take over the world, appears to me largely like magic (I am not trying to be rude, by magic I only mean that I don’t understand the details). At the very least there are a lot of open questions. Even given that for the sake of the above posts I accepted that the AI is superhuman and can do such things as deceive humans by its superior knowledge of human psychology. Which seems to be non-trivial assumption, to say the least.
Over and over I told you that given all your assumptions, I agree that AGI is an existential risk.
You did not reply to my argument. My argument was that if the seed is unfriendly then it will not be smart enough to hide its unfriendliness. My argument did not pertain the possibility of a friendly seed turning unfriendly.
What I have been arguing is that an AI should not be expected, by default, to want to eliminate all possible obstructions. There are many graduations here. That, by some economic or otherwise theoretic argument, it might be instrumentally rational for some ideal AI to take over the world, does not mean that humans would create such an AI, or that an AI could not be limited to care about fires in its server farm rather than that Russia might nuke the U.S. and thereby destroy its servers.
Did you mean to reply to another point? I don’t see how the reply you linked to is relevant to what I wrote.
My argument is that an AI does not need to consider all possible threats and care to acquire all possible resources. Based on its design it could just want to optimize using its initial resources while only considering mundane threats. I just don’t see real-world AIs to conclude that they need to take over the world. I don’t think an AI is likely going to be designed that way. I also don’t think such an AI could work, because such inferences would require enormous amounts of resources.
I have done what is possible given my current level of education and what I perceive to be useful. I have e.g. asked experts about their opinion.
A few general remarks about the kind of papers such as the one that you linked to.
How much should I update towards MIRI’s position if I (1) understood the arguments in the paper (2) found the arguments convincing?
My answer is the following. If the paper was about the abc conjecture, the P versus NP problem, climate change, or even such mundane topics as psychology, I would either not be able to understand the paper, would be unable to verify the claims, or would have very little confidence in my judgement.
So what about ‘Intelligence Explosion Microeconomics’? That I can read most of it is only due to the fact that it is very informally written. The topic itself is more difficult and complex than all of the above mentioned problems together. Yet the arguments in support of it, to exaggerate a little bit, contain less rigor than the abstract of one of Shinichi Mochizuki’s papers on the abc conjecture.
Which means that my answer is that I should update very little towards MIRI’s position and that any confidence I gain about MIRI’s position is probably highly unreliable.
Thanks. My feeling is that to gain any confidence into what all this technically means, and to answer all the questions this raises, I’d probably need about 20 years of study.
Here is part of a post exemplifying how I understand the relation between goals and intelligence:
If a goal has very few constraints then the set that satisfies all constraints is very large. A vague and ambiguous goal allows for too much freedom in the sense that a wide range of world states would have the same expected value and therefore imply a very large solution space, since a wide range of AI’s will be able to achieve those world states and thereby satisfy the condition of being improved versions of their predecessor.
This means that in order to get an AI to become superhuman at all, and very quickly in particular, you will need to encode a very specific goal against which mistakes, optimization power and achievement can be judged.
It is really hard to communicate how I perceive this and other discussions about MIRI’s position without offending people, or killing the discussion.
I am saying this in full honesty. The position you appear to support seems so utterly “complex” (far-fetched) that the current arguments are unconvincing.
Here is my perception of the scenario that you try to sell me (exaggerated to make a point). I have a million questions about it that I can’t answer and which your answers either sidestep or explain away by using “magic”.
At this point I probably made 90% of the people reading this comment incredible angry. My perception is that you cannot communicate this perception on LessWrong without getting into serious trouble. That’s also what I meant when I told you that I cannot be completely honest if you want to discuss this on LessWrong.
I can also assure you that many people who are much smarter and higher status than me think so as well. Many people communicated the absurdity of all this to me but told me that they would not repeat this in public.
Pretending to be friendly when you’re actually not is something that doesn’t even require human level intelligence. You could even do it accidentally.
In general, the appearance of Friendliness at low levels of ability to influence the world doesn’t guarantee actual Friendliness at high levels of ability to influence the world. (If it did, elected politicians would be much higher quality.)
But it will scare friendly ones, which will want to keep their values stable.
It takes stupidity to misinterpret friendlienss.
Yes. If an AI is Friendly at one stage, then it is Friendly at every subsequent stage. This doesn’t help make almost-Friendly AIs become genuinely Friendly, though.
Yes, but that’s stupidity on the part of the human programmer, and/or on the part of the seed AI if we ask it for advice. The superintelligence didn’t write its own utility function; the superintelligence may well understand Friendliness perfectly, but that doesn’t matter if it hasn’t been programmed to rewrite its source code to reflect its best understanding of ‘Friendliness’. The seed is not the superintelligence. See: http://lesswrong.com/lw/igf/the_genie_knows_but_doesnt_care/
That depends on the architecture. In a Loosemore architecture, the AI interprets high-level directives itself, so if it gets them wrong, that’s it’s mistake.
… and whose fault is that?
http://lesswrong.com/lw/rf/ghosts_in_the_machine/