Nobody disagrees that an arbitrary agent pulled from mind design space, that is powerful enough to overpower humanity, is an existential risk if it either exhibits Omohundro’s AI drives or is used as a tool by humans, either carelessly or to gain power over other humans.
Disagreeing with that would about make as much sense as claiming that out-of-control self-replicating robots could somehow magically turn the world into a paradise, rather than grey goo.
The disagreement is mainly about the manner in which we will achieve such AIs, how quickly that will happen, and whether such AIs will have these drives.
I actually believe that much less than superhuman general intelligence might be required for humans to cause extinction type scenarios.
Most of my posts specifically deal with the scenario and arguments publicized by MIRI. Those posts are not highly polished papers but attempts to reduce my own confusion and to enable others to provide feedback.
I argue that...
...the idea of a vast mind design space is largely irrelevant, because AIs will be created by humans, which will considerably limit the kind of minds we should expect.
...that AIs created by humans do not need to, and will not exhibit any of Omohundro’s AI drives.
...that even given Omohundro’s AI drives, it is not clear how such AIs would arrive at the decision to take over the world.
...that there will be no fast transition from largely well-behaved narrow AIs to unbounded general AIs, and that humans will be part of any transition.
...that any given AI will initially not be intelligent enough to hide any plans for world domination.
...that drives as outlined by Omohundro would lead to a dramatic interference with what the AI’s creators want it to do, before it could possibly become powerful enough to deceive or overpower them, and would therefore be noticed in time.
...that even if MIRI’s scenario comes to pass, there is a lack of concrete scenarios on how such an AI could possibly take over the world, and that the given scenarios raise many questions.
What I, and I believe Richard Loosemore as well, have been arguing, as quoted above, is just one specific point that is not supposed to say much about AI risks in general. Below is an distilled version of what I personally meant:
1. Superhuman general intelligence, obtained by the self-improvement of a seed AI, is a very small target to hit, requiring a very small margin of error.
2. Intelligently designed systems do not behave intelligently as a result of unintended consequences. (See note 1 below.)
3. By step 1 and 2, for an AI to be able to outsmart humans, humans will have to intend to make an AI capable of outsmarting them and succeed at encoding their intention of making it outsmart them.
4. Intelligence is instrumentally useful, because it enables a system to hit smaller targets in larger and less structured spaces. (See note 2, 3.)
5. In order to take over the world a system will have to be able to hit a lot of small targets in very large and unstructured spaces.
6. The intersection of the sets of “AIs in mind design space” and “the first probable AIs to be expected in the near future” contains almost exclusively those AIs that will be designed by humans.
7. By step 6, what an AI is meant to do will very likely originate from humans.
8. It is easier to create an AI that applies its intelligence generally than to create an AI that only uses its intelligence selectively. (See note 4.)
9. An AI equipped with the capabilities required by step 5, given step 7 and 8, will very likely not be confused about what it is meant to do, if it was not meant to be confused.
10. Therefore the intersection of the sets of “AIs designed by humans” and “dangerous AIs” only contains almost exclusively those AIs which are deliberately designed to be dangerous by malicious humans.
Notes
Software such as Mathematica will not casually prove the Riemann hypothesis if it has not been programmed to do so. Given intelligently designed software, world states in which the Riemann hypothesis is proven will not be achieved if they were not intended because the nature of unintended consequences is overall chaotic.
As the intelligence of a system increases the precision of the input, that is necessary to make the system do what humans mean it to do, decreases. For example, systems such as IBM Watson or Apple’s Siri do what humans mean them to do when fed with a wide range of natural language inputs. While less intelligent systems such as compilers or Google Maps need very specific inputs in order to satisfy human intentions. Increasing the intelligence of Google Maps will enable it to satisfy human intentions by parsing less specific commands.
When producing a chair an AI will have to either know the specifications of the chair (such as its size or the material it is supposed to be made of) or else know how to choose a specification from an otherwise infinite set of possible specifications. Given a poorly designed fitness function, or the inability to refine its fitness function, an AI will either (a) not know what to do or (b) will not be able to converge on a qualitative solution, if at all, given limited computationally resources.
For an AI to misinterpret what it is meant to do it would have to selectively suspend using its ability to derive exact meaning from fuzzy meaning, which is a significant part of general intelligence. This would require its creators to restrict their AI and specify an alternative way to learn what it is meant to do (which takes additional, intentional effort). Because an AI that does not know what it is meant to do, and which is not allowed to use its intelligence to learn what it is meant to do, would have to choose its actions from an infinite set of possible actions. Such a poorly designed AI will either (a) not do anything at all or (b) will not be able to decide what to do before the heat death of the universe, given limited computationally resources. Such a poorly designed AI will not even be able to decide if trying to acquire unlimited computationally resources was instrumentally rational because it will be unable to decide if the actions that are required to acquire those resources might be instrumentally irrational from the perspective of what it is meant to do.
“You write that the worry is that the superintelligence won’t care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean?”
“If an AI is meant to behave generally intelligent [sic] then it will have to work as intended or otherwise fail to be generally intelligent.”
It’s relatively easy to get an AI to care about (optimize for) something-or-other; what’s hard is getting one to care about the right something.
‘Working as intended’ is a simple phrase, but behind it lies a monstrously complex referent. It doesn’t clearly distinguish the programmers’ (mostly implicit) true preferences from their stated design objectives; an AI’s actual code can differ from either or both of these. Crucially, what an AI is ‘intended’ for isn’t all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to kill Friendliness much more easily than intelligence. Your argument is misleading because it trades on treating this simple phrase as though it were all-or-nothing, a monolith; but all failures for a device to ‘work as intended’ in human history have involved at least some of the intended properties of that device coming to fruition.
It may be hard to build self-modifying AGI. But it’s not the same hardness as the hardness of Friendliness Theory. As a programmer, being able to hit one small target doesn’t entail that you can or will hit every small target it would be in your best interest to hit. See the last section of my post above.
I suggest that it’s a straw man to claim that anyone has argued ‘the superintelligence wouldn’t understand what you wanted it to do, if you didn’t program it to fully understand that at the outset’. Do you have evidence that this is a position held by, say, anyone at MIRI? The post you’re replying to points out that the real claim is that the superintelligence won’t care what you wanted it to do, if you didn’t program it to care about the specific right thing at the outset. That makes your criticism seem very much like a change of topic.
Superintelligence may imply an ability to understand instructions, but it doesn’t imply a desire to rewrite one’s utility function to better reflect human values. Any such desire would need to come from the utility function itself, and if we’re worried that humans may get that utility function wrong, then we should also be worried that humans may get the part of the utility function that modifies the utility function wrong.
I suggest that it’s a straw man to claim that anyone has argued ‘the superintelligence wouldn’t understand what you wanted it to do, if you didn’t program it to fully understand that at the outset’. Do you have evidence that this is a position held by, say, anyone at MIRI?
MIRI assumes that programming what you want an AI to do at the outset , Big Design Up Front, is a desirable feature for some reason.
The most common argument is that it is a necessary prerequisite for provable correctness, which is a desirable safety feature. OTOH, the exact opposite of massive hardcoding, goal flexibility is ielf a necessary prerequisite for corrigibility, which is itself a desirable safety feature.
The latter point has not been argued against adequately, IMO.
9. An AI equipped with the capabilities required by step 5, given step 7 and 8, will very likely not be confused about what it is meant to do, if it was not meant to be confused.
I do not reject that step 10 does not follow if you reject that the AI will not “care” to learn what it is meant to do. But I believe there to be good reasons for an AI created by humans to care.
If you assume that this future software does not care, can you pinpoint when software stops caring?
1. Present-day software is better than previous software generations at understanding and doing what humans mean.
2. There will be future generations of software which will be better than the current generation at understanding and doing what humans mean.
3. If there is better software, there will be even better software afterwards.
4. …
5. Software will be superhuman good at understanding what humans mean but catastrophically worse than all previous generations at doing what humans mean.
What happens between step 3 and 5, and how do you justify it?
My guess is that you will write that there will not be a step 4, but instead a sudden transition from narrow AIs to something you call a seed AI, which is capable of making itself superhuman powerful in a very short time. And as I wrote in the comment you replied to, if I was to accept that assumption, then we would be in full agreement about AI risks. But I reject that assumption. I do not believe such a seed AI to be possible and believe that even if it was possible it would not work the way you think it would work. It would have to aquire information about what it is supposed to do, for pratical reasons.
Present day software is a series of increasing powerful narrow tools and abstractions. None of them encode anything remotely resembling the values of their users. Indeed, present-day software that tries to “do what you mean” is in my experience incredibly annoying and difficult to use, compared to software that simply presents a simple interface to a system with comprehensible mechanics.
Put simply, no software today cares about what you want. Furthermore, your general reasoning process here—define some vague measure of “software doing what you want”, observe an increasing trend line and extrapolate to a future situation—is exactly the kind of reasoning I always try to avoid, because it is usually misleading and heuristic.
Look at the actual mechanics of the situation. A program that literally wants to do what you mean is a complicated thing. No realistic progression of updates to Google Maps, say, gets anywhere close to building an accurate world-model describing its human users, plus having a built-in goal system that happens to specifically identify humans in its model and deduce their extrapolated goals. As EY has said, there is no ghost in the machine that checks your code to make sure it doesn’t make any “mistakes” like doing something the programmer didn’t intend. If it’s not programmed to care about what the programmer wanted, it won’t.
A program that literally wants to do what you mean is a complicated thing. No realistic progression of updates to Google Maps, say, gets anywhere close to building an accurate world-model describing its human users, plus having a built-in goal system that happens to specifically identify humans in its model and deduce their extrapolated goals.
Is it just me, or does this sound like it could grow out of advertisement services? I think it’s the one industry that directly profits from generically modelling what users “want”¹and then delivering it to them.
[edit] ¹where “want” == “will click on and hopefully buy”
Present day software is a series of increasing powerful narrow tools and abstractions.
Do you believe that any kind of general intelligence is practically feasible that is not a collection of powerful narrow tools and abstractions? What makes you think so?
Put simply, no software today cares about what you want.
If all I care about is a list of Fibonacci numbers, what is the difference regarding the word “care” between a simple recursive algorithm and a general AI?
Furthermore, your general reasoning process here—define some vague measure of “software doing what you want”, observe an increasing trend line and extrapolate to a future situation—is exactly the kind of reasoning I always try to avoid, because it is usually misleading and heuristic.
My measure of “software doing what you want” is not vague. I mean it quite literally. If I want software to output a series of Fibonacci numbers, and it does output a series of Fibonacci numbers, then it does what I want.
And what other than an increasing trend line do you suggest would be a rational means of extrapolation, sudden jumps and transitions?
Present day software may not have got far with regard to the evaluative side of doing what you want, but the XiXiDu’s point seems to be that it is getting better at the semantic side. Who was it who said the value problem is part of the semantic problem?
Software will be superhuman good at understanding what humans mean but catastrophically worse than all previous generations at doing what humans mean.
Every bit of additional functionality requires huge amounts of HUMAN development and testing, not in order to compile and run (that’s easy), but in order to WORK AS YOU WANT IT TO.
I can fully believe that a superhuman intelligence examining you will be fully capable of calculating “what you mean” “what you want” “what you fear” “what would be funniest for a buzzfeed artcle if I pretended to misunderstand your statement as meaning” “what would be best for you according to your values” “what would be best for you according to your cat’s values” “what would be best for you according to Genghis Khan’s values” .
No program now cares about what you mean. You’ve still not given any reason for the future software to care about “what you mean” over all those other calculation either.
I kind of doubt that autocorrect software really changed “meditating” to “masturbating”. Because of stuff like this. Edit: And because, start at the left and working rightward, they only share 1 letter before diverging, and because I’ve seen a spell-checker with special behavior for dirty/curse words (Not suggesting them as corrected spellings, but also not complaining about them as unrecognized words) (this is the one spell-checker which, out of curiousity, I decided to check its behavior with dirty/curse words, so I bet it’s common). Edit 2: Also from a causal history perspective of why a doubt it, rather than a normative justification perspective, there’s the fact that Yvain linked it and said something like “I don’t care if these are real.” Edit 3: typo.
No program now cares about what you mean. You’ve still not given any reason for the future software to care about “what you mean” over all those other calculation either.
I agree that current software products fail, such as in your autocorrect example. But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting “meditating” to “masturbating”?
Imagine it would make similar mistakes in any of the problems that it is required to solve in order to overpower humanity. And if humans succeeded to make it not make such mistakes along the way to overpowering humanity, how did they selectively fail at making it want to overpower humanity in the first place? How likely is that?
But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting “meditating” to “masturbating”?
Those are only ‘mistakes’ if you value human intentions. A grammatical error is only an error because we value the specific rules of grammar we do; it’s not the same sort of thing as a false belief (though it may stem from, or result in, false beliefs).
A machine programmed to terminally value the outputs of a modern-day autocorrect will never self-modify to improve on that algorithm or its outputs (because that would violate its terminal values). The fact that this seems silly to a human doesn’t provide any causal mechanism for the AI to change its core preferences. Have we successfully coded the AI not to do things that humans find silly, and to prize un-silliness before all other things? If not, then where will that value come from?
A belief can be factually wrong. A non-representational behavior (or dynamic) is never factually right or wrong, only normatively right or wrong. (And that normative wrongness only constrains what actually occurs to the extent the norm is one a sufficiently powerful agent in the vicinity actually holds.)
Maybe that distinction is the one that’s missing. You’re assuming that an AI will be capable of optimizing for true beliefs if and only if it is also optimizing for possessing human norms. But, by the is/ought distinction, there is no true beliefs about the physical world that will spontaneously force a being that believes it to become more virtuous, if it didn’t already have a relevant seed of virtue within itself.
I’m confused as to the reason for the warning/outing, especially since the community seems to be doing an excellent job of dealing with his somewhat disjointed arguments. Downvotes, refutation, or banning in extreme cases are all viable forum-preserving responses. Publishing a dissenter’s name seems at best bad manners and at worst rather crass intimidation.
I only did a quick search on him and although some of the behavior was quite obnoxious, is there anything I’ve missed that justifies this?
XiXiDu wasn’t attempting or requesting anonymity—his LW profile openly lists his true name—and Alexander Kruel is someone with known problems (and a blog openly run under his true name) whom RobbBB might not know offhand was the same person as “XiXiDu” although this is public knowledge, nor might RobbBB realize that XiXiDu had the same irredeemable status as Loosemore.
I would not randomly out an LW poster for purposes of intimidation—I don’t think I’ve ever looked at a username’s associated private email address. Ever. Actually I’m not even sure offhand if our registration process requires/verifies that or not, since I was created as a pre-existing user at the dawn of time.
I do consider RobbBB’s work highly valuable and I don’t want him to feel disheartened by mistakenly thinking that a couple of eternal and irredeemable semitrolls are representative samples. Due to Civilizational Inadequacy, I don’t think it’s possible to ever convince the field of AI or philosophy of anything even as basic as the Orthogonality Thesis, but even I am not cynical enough to think that Loosemore or Kruel are representative samples.
Thanks, Eliezer! I knew who XiXiDu is. (And if I hadn’t, I think the content of his posts makes it easy to infer.)
There are a variety of reasons I find this discussion useful at the moment, and decided to stir it up. In particular, ground-floor disputes like this can be handy for forcing me to taboo inferential-gap-laden ideas and to convert premises I haven’t thought about at much length into actual arguments. But one of my reasons is not ‘I think this is representative of what serious FAI discussions look like (or ought to look like)’, no.
Glad to hear. It is interesting data that you managed to bring in 3 big name trolls for a single thread, considering their previous dispersion and lack of interest.
Thank you for the clarification. While I have a certain hesitance to throw around terms like “irredeemable”, I do understand the frustration with a certain, let’s say, overconfident and persistent brand of misunderstanding and how difficult it can be to maintain a public forum in its presence.
My one suggestion is that, if the goal was to avoid RobbBB’s (wonderfully high-quality comments, by the way) confusion, a private message might have been better. If the goal was more generally to minimize the confusion for those of us who are newer or less versed in LessWrong lore, more description might have been useful (“a known and persistent troll” or whatever) rather than just providing a name from the enemies list.
Though actually, Eliezer used similar phrasing regarding Richard Loosemore and got downvoted for it (not just by me). Admittedly, “persistent troll” is less extreme than “permanent idiot,” but even so, the statement could be phrased to be more useful.
I’d suggest, “We’ve presented similar arguments to [person] already, and [he or she] remained unconvinced. Ponder carefully before deciding to spend much time arguing with [him or her].”
Not only is it less offensive this way, it does a better job of explaining itself. (Note: the “ponder carefully” section is quoting Eliezer; that part of his post was fine.)
Those are only ‘mistakes’ if you value human intentions. A grammatical error is only an error because we value the specific rules of grammar we do; it’s not the same sort of thing as a false belief (though it may stem from, or result in, false beliefs).
You will see a grammatical error as a mistake if you value grammar in general, or if you value being right in general.
A self-improving AI needs a goal. A goal of self-improvement alone would work. A goal of getting things right in general would work too, and be much safer, as it would include getting our intentions right as a sub-goal.
Although since “self-improvement” in this context basically refers to “improving your ability to accomplish goals”...
You will see a grammatical error as a mistake if you value grammar in general, or if you value being right in general.
Stop me if this is a non-secteur, but surely “having accurate beliefs” and “acting on those beliefs in a particular way” are completely different things? I haven’t really been following this conversation, though.
But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting “meditating” to “masturbating”?
As Robb said you’re confusing mistake in the sense of “The program is doing something we don’t want to do” with mistake in the sense of “The program has wrong beliefs about reality”.
I suppose a different way of thinking about these is “A mistaken human belief about the program” vs “A mistaken computer belief about the human”. We keep talking about the former (the program does something we didn’t know it would do), and you keep treating it as if it’s the latter.
Let’s say we have a program (not an AI, just a program) which uses Newton’s laws in order to calculate the trajectory of a ball. We want it to calculate this in order to have it move a tennis racket and hit the ball back.
When it finally runs, we observe that the program always avoids the ball rather than hit it back. Is it because it’s calculating the trajectory of the ball wrongly? No, it calculates the trajectory very well indeed, it’s just that an instruction in the program was wrongly inserted so that the end result is “DO NOT hit the ball back”.
It knows what the “trajectory of the ball” is. It knows what “hit the ball” is. But it’s program is “DO NOT hit the ball” rather than “hit the ball”. Why? Because of a human mistaken belief on what the program would do, not the program’s mistaken belief.
To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI:
Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI.
(1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree?
NAI: True
GAI: True
(2) Under what circumstances does it fail to behave in accordance with human intention?
NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations.
GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
(4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time?
NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
Please let me also know where you disagree with the following points:
(1) The abilities of systems are part of human preferences as humans intend to give systems certain capabilities and, as a prerequisite to build such systems, have to succeed at implementing their intentions.
(2) Error detection and prevention is such a capability.
(3) Something that is not better than humans at preventing errors is no existential risk.
(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.
(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.
To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI:
Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI.
(1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree?
NAI: True
GAI: True
(2) Under what circumstances does it fail to behave in accordance with human intention?
NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations.
GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
(4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time?
NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
Please let me also know where you disagree with the following points:
(1) The abilities of systems are part of human preferences as humans intend to give systems certain capabilities and, as a prerequisite to build such systems, have to succeed at implementing their intentions.
(2) Error detection and prevention is such a capability.
(3) Something that is not better than humans at preventing errors is no existential risk.
(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.
(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
GAI is a program. It always does what it’s programmed to do. That’s the problem—a program that was written incorrectly will generally never do what it was intended to do.
FWIW, I find your statements 3,4,5 also highly objectionable, on the grounds that you are lumping a large class of things under the blank label “errors”. Is an “error” doing something that humans don’t want? Is it doing something the agent doesn’t want? Is it accidentally mistyping a letter in a program, causing a syntax error, or thinking about something heuristically and coming to the wrong conclusion, then making carefully planned decision based on that mistake? Automatic proof systems don’t save you if you what you think you need to prove isn’t actually what you need to prove.
GAI is a program. It always does what it’s programmed to do. That’s the problem—a program that was written incorrectly will generally never do what it was intended to do.
So self-correcting software is impossible. Is self improving software possible?
Self-correcting software is possible if there’s a correct implementation of what “correctness” means, and the module that has the correct implementation has control over the modules that don’t have the correct implementation.
Self-improving software are likewise possible if there’s a correct implementation of the definition of “improvement”.
Right now, I’m guessing that it’d be relatively easy to programmatically define “performance improvement” and difficult to define “moral and ethical improvement”.
1) Poorly defined terms “human intention” and “sufficient”.
2) Possibly under any circumstances whatsoever, if it’s anything like other non-trivial software, which always has some bugs.
3) Anything from “you may not notice” to “catastrophic failure resulting in deaths”. Claim that failure of software to work as humans intend will “generally fail in a way that is harmful to it’s own functioning” is unsupported. E.g. a spreadsheet works fine if the floating point math is off in the 20th bit of the mantissa. The answers will be wrong, but there is nothing about that that the spreadsheet could be expected to care about,
4) Not necessarily. GAI may continue to try to do what it was programmed to do, and only unintentionally destroy a small city in the process :)
Second list:
1) Wrong. The abilities of sufficiently complex systems are a huge space of events humans haven’t thought about yet, and so do not yet have preferences about. There is no way to know what their preferences would or should be for many many outcomes.
2) Error as failure to perform the requested action may take precedence over error as failure to anticipate hypothetical objections from some humans to something they hadn’t expected. For one thing, it is more clearly defined. We already know human-level intelligences act this way.
3) Asteroids and supervolcanoes are not better than humans at preventing errors. It is perfectly possible for something stupid to be able to kill you. Therefore something with greater cognitive and material resources than you, but still with the capacity to make mistakes can certainly kill you. For example, a government.
4) It is already possible for a very fallible human to make something that is better than humans at detecting certain kinds of errors.
5) No. Unless by dramatic you mean “impossibly perfect, magical and universal”.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
Firstly, “fails in a way that is harmful to its own functioning” appears to be tautological.
Secondly, you seem to be listing things that apply to any kind of AI in the NAI section—is this intentional? (This happens throughout your comment, in fact.)
Software that initially appears to care what you mean will be selected by market forces. But nearly all software that superficially looks Friendly isn’t Friendly. If there are seasoned AI researchers who can’t wrap their heads around the five theses, then how can I be confident that the Invisible Hand will both surpass them intellectually and recurrently sacrifice short-term gains on this basis?
Software that looks friendly isn’t really friendly in the sense that it really understands what we want. It isn’t dangerously unfriendly because we’re still here. If its commercially successful, it’s friendly enough for us to want it in our lives.
Human beings aren’t friendly, in the Friendly-AI sense. If a random human acquired immense power, it would probably result in an existential catastrophe. Humans do have a better sense of human value than, say, a can-opener does; they have more power and autonomy than a can-opener, so they need fuller access to human values in order to reach similar safety levels. A superintelligent AI would require even more access to human values to reach comparable safety levels.
If you grafted absolute power onto a human with average ethical insight, you might get absolute corruption. But what is that analogous to in .AI terms? Why assume asymmetric development by default?
If you assume top down singleton AI with a walled of ethics module, things look difficult. If you reverse this assumptions, FAI is already happening.
XiXiDu, I get the impression you’ve never coded anything. Is that accurate?
Present-day software is better than previous software generations at understanding and doing what humans mean.
Increasing the intelligence of Google Maps will enable it to satisfy human intentions by parsing less specific commands.
Present-day everyday software (e.g. Google Maps, Siri) is better at doing what humans mean. It is not better at understanding humans. Learning programs like the one that runs PARO appear to be good at understanding humans, but are actually following a very simple utility function (in the decision sense, not the experiental sense); they change their behaviour in response to programmed cues, generally by doing more/less of actions associated with those cues (example: PARO “likes” being stroked and will do more of things that tend to preceed stroking). In each case of a program that improves itself, it has a simple thing it “wants” to optimise and makes changes according to how well it seems to be doing.
Making software that understands humans at all is beyond our current capabilities. Theory of mind, the ability to recognise agents and see them as having desires of their own, is something we have no idea how to produce; we don’t even know how humans have it. General intelligence is an enormous step beyond programming something like Siri. Siri is “just” interpreting vocal commands as text (which requires no general intelligence), matching that to a list of question structures (which requires no general intelligence; Siri does not have to understand what the word “where” means to know that Google Maps may be useful for that type of question) and delegating to Web services, with a layer of learning code to produce more of the results you liked (i.e., that made you stop asking related questions) in the past. Siri is using a very small built-in amount of knowledge and an even smaller amount of learned knowledge to fake understanding, but it’s just pattern-matching. While the second step is the root of general intelligence, it’s almost all provided by humans who understood that “where” means a question is probably to do with geography; Siri’s ability to improve this step is virtually nonexistent.
catastrophically worse than all previous generations at doing what humans mean
The more powerful something is, the more dangerous it is. A very stupid adult is much more dangerous than a very intelligent child because adults are allowed to drive cars. Driving a car requires very little intelligence and no general intelligence whatsoever (we already have robots that can do a pretty good job), but can go catastrophically wrong very easily. Holding an intelligent conversation requires huge amounts of specialised intelligence and often requires general intelligence, but nothing a four-year-old says is likely to kill people.
It’s much easier to make a program that does a good job at task-completion, and is therefore given considerable power and autonomy (Siri, for example), than it is to make sure that the program never does stupid things with its power. Developing software we already have could easily lead to programs being assigned large amounts of power (e.g., “Siri 2, buy me a ticket to New York”, which would almost always produce the appropriate kind of ticket), but I certainly wouldn’t trust such programs to never make colossal screw-ups. (Siri 2 will only tell you that you can’t afford a ticket if a human programmer thought that might be important, because Siri 2 does not care that you need to buy groceries, because it does not understand that you exist.)
I hope I have convinced you that present software only fakes understanding and that developing it will not produce software that can do better than an intelligent human with the same resources. Siri 2 will not be more than a very useful tool, and neither will Siri 5. Software does not stop caring because it has never cared.
It is very easy (relatively speaking) to produce code that can fake understanding and act like it cares about your objectives, because this merely requires a good outline of the sort of things the code is likely to be wanted for. (This is the second stage of Siri outlined above, where Siri refers to a list saying that “where” means that Google Maps is probably the best service to outsource to.) Making code that does more of the things that get good results is also very easy.
Making code that actually cares requires outlining exactly what the code is really and truly wanted to do. You can’t delegate this step by saying “Learn what I care about and then satisfy me” because that’s just changing what you want the code to do. It might or might not be easier than saying “This is what I care about, satisfy me”, but at some stage you have to say what you want done exactly right or the code will do something else. (Currently getting it wrong is pretty safe because computers have little autonomy and very little general intelligence, so they mostly do nothing much; getting it wrong with a UFAI is dangerous because the AI will succeed at doing the wrong thing, probably on a big scale.) This is the only kind of code you can trust to program itself and to have significant power, because it’s the only kind that will modify itself right.
You can’t progress Siri into an FAI, no matter how much you know about producing general intelligence. You need to know either Meaning-in-General, Preferences-in-General or exactly what Human Prefernces are, or you won’t get what you hoped for.
Another perspective: the number of humans in history who were friendly is very, very small. The number of humans who are something resembling capital-F Friendly is virtually nil. Why should “an AI created by humans to care” be Friendly, or even friendly? Unless friendliness or Friendliness is your specific goal, you’ll probably produce software that is friendly-to-the-maker (or maybe Friendly-to-the-maker, if making Friendly code really is as easy as you seem to think). Who would you trust with a superintelligence that did exactly what they said? Who would you trust with a superintelligence that did exactly what they really wanted, not what they said? I wouldn’t trust my mother with either, and she’s certainly highly intelligent and has my best interests at heart. I’d need a fair amount of convincing to trust me with either. Most humans couldn’t program AIs that care because most humans don’t care themselves, let alone know how to express it.
Making software that understands humans at all is beyond our current capabilities.
So you believe that “understanding” is an all or nothing capability? I did never intend to use “understanding” like this. My use of the term is such that if my speech recognition software correctly transcribes 98% of what I am saying then it is better at understanding how certain sounds are related to certain strings of characters than a software that correctly transcribes 95% of what I said.
General intelligence is an enormous step beyond programming something like Siri.
One enormous step or a huge number of steps? If the former, what makes you think so? If the latter, then at what point do better versions of Siri start acting in catastrophic ways?
Siri is using a very small built-in amount of knowledge and an even smaller amount of learned knowledge to fake understanding, but it’s just pattern-matching. While the second step is the root of general intelligence, it’s almost all provided by humans who understood that “where” means a question is probably to do with geography;
Most of what humans understand is provided by other humans who themselves got another cruder version from other humans.
It’s much easier to make a program that does a good job at task-completion, and is therefore given considerable power and autonomy (Siri, for example), than it is to make sure that the program never does stupid things with its power.
If an AI is not supposed to take over the world, then from the perspective of humans it is mistaken to take over the world. Humans got something wrong about the AI design if it takes over the world. Now if needs to solve a minimum of N problems correctly in order to take over the world, then this means that it succeeded N times at being general intelligent at executing a stupid thing. The question that arises here is whether it is more likely for humans to build an AI that works perfectly well along a number of dimensions at doing a stupid thing than an AI that fails at doing a stupid thing because it does other stupid things as well?
Developing software we already have could easily lead to programs being assigned large amounts of power (e.g., “Siri 2, buy me a ticket to New York”, which would almost always produce the appropriate kind of ticket), but I certainly wouldn’t trust such programs to never make colossal screw-ups.
Sure, I do not disagree with this at all. AI will very likely lead to catastrophic events. I merely disagree with the dumb superintelligence scenario.
...getting it wrong with a UFAI is dangerous because the AI will succeed at doing the wrong thing, probably on a big scale.
In other words, humans are likely to fail at AI in such a way that it works perfectly well in a catastrophic way.
Another perspective: the number of humans in history who were friendly is very, very small.
I certainly do not reject that general AI is extremely dangerous in the hands of unfriendly humans and that only a friendly AI that takes over the world could eventually prevent a catastrophe. I am rejecting the dumb superintelligence scenario.
Nobody disagrees that an arbitrary agent pulled from mind design space, that is powerful enough to overpower humanity, is an existential risk if it either exhibits Omohundro’s AI drives or is used as a tool by humans, either carelessly or to gain power over other humans.
Disagreeing with that would about make as much sense as claiming that out-of-control self-replicating robots could somehow magically turn the world into a paradise, rather than grey goo.
The disagreement is mainly about the manner in which we will achieve such AIs, how quickly that will happen, and whether such AIs will have these drives.
I actually believe that much less than superhuman general intelligence might be required for humans to cause extinction type scenarios.
Most of my posts specifically deal with the scenario and arguments publicized by MIRI. Those posts are not highly polished papers but attempts to reduce my own confusion and to enable others to provide feedback.
I argue that...
...the idea of a vast mind design space is largely irrelevant, because AIs will be created by humans, which will considerably limit the kind of minds we should expect.
...that AIs created by humans do not need to, and will not exhibit any of Omohundro’s AI drives.
...that even given Omohundro’s AI drives, it is not clear how such AIs would arrive at the decision to take over the world.
...that there will be no fast transition from largely well-behaved narrow AIs to unbounded general AIs, and that humans will be part of any transition.
...that any given AI will initially not be intelligent enough to hide any plans for world domination.
...that drives as outlined by Omohundro would lead to a dramatic interference with what the AI’s creators want it to do, before it could possibly become powerful enough to deceive or overpower them, and would therefore be noticed in time.
...that even if MIRI’s scenario comes to pass, there is a lack of concrete scenarios on how such an AI could possibly take over the world, and that the given scenarios raise many questions.
There are a lot more points of disagreement.
What I, and I believe Richard Loosemore as well, have been arguing, as quoted above, is just one specific point that is not supposed to say much about AI risks in general. Below is an distilled version of what I personally meant:
1. Superhuman general intelligence, obtained by the self-improvement of a seed AI, is a very small target to hit, requiring a very small margin of error.
2. Intelligently designed systems do not behave intelligently as a result of unintended consequences. (See note 1 below.)
3. By step 1 and 2, for an AI to be able to outsmart humans, humans will have to intend to make an AI capable of outsmarting them and succeed at encoding their intention of making it outsmart them.
4. Intelligence is instrumentally useful, because it enables a system to hit smaller targets in larger and less structured spaces. (See note 2, 3.)
5. In order to take over the world a system will have to be able to hit a lot of small targets in very large and unstructured spaces.
6. The intersection of the sets of “AIs in mind design space” and “the first probable AIs to be expected in the near future” contains almost exclusively those AIs that will be designed by humans.
7. By step 6, what an AI is meant to do will very likely originate from humans.
8. It is easier to create an AI that applies its intelligence generally than to create an AI that only uses its intelligence selectively. (See note 4.)
9. An AI equipped with the capabilities required by step 5, given step 7 and 8, will very likely not be confused about what it is meant to do, if it was not meant to be confused.
10. Therefore the intersection of the sets of “AIs designed by humans” and “dangerous AIs” only contains almost exclusively those AIs which are deliberately designed to be dangerous by malicious humans.
Notes
Software such as Mathematica will not casually prove the Riemann hypothesis if it has not been programmed to do so. Given intelligently designed software, world states in which the Riemann hypothesis is proven will not be achieved if they were not intended because the nature of unintended consequences is overall chaotic.
As the intelligence of a system increases the precision of the input, that is necessary to make the system do what humans mean it to do, decreases. For example, systems such as IBM Watson or Apple’s Siri do what humans mean them to do when fed with a wide range of natural language inputs. While less intelligent systems such as compilers or Google Maps need very specific inputs in order to satisfy human intentions. Increasing the intelligence of Google Maps will enable it to satisfy human intentions by parsing less specific commands.
When producing a chair an AI will have to either know the specifications of the chair (such as its size or the material it is supposed to be made of) or else know how to choose a specification from an otherwise infinite set of possible specifications. Given a poorly designed fitness function, or the inability to refine its fitness function, an AI will either (a) not know what to do or (b) will not be able to converge on a qualitative solution, if at all, given limited computationally resources.
For an AI to misinterpret what it is meant to do it would have to selectively suspend using its ability to derive exact meaning from fuzzy meaning, which is a significant part of general intelligence. This would require its creators to restrict their AI and specify an alternative way to learn what it is meant to do (which takes additional, intentional effort). Because an AI that does not know what it is meant to do, and which is not allowed to use its intelligence to learn what it is meant to do, would have to choose its actions from an infinite set of possible actions. Such a poorly designed AI will either (a) not do anything at all or (b) will not be able to decide what to do before the heat death of the universe, given limited computationally resources. Such a poorly designed AI will not even be able to decide if trying to acquire unlimited computationally resources was instrumentally rational because it will be unable to decide if the actions that are required to acquire those resources might be instrumentally irrational from the perspective of what it is meant to do.
This mirrors some comments you wrote recently:
It’s relatively easy to get an AI to care about (optimize for) something-or-other; what’s hard is getting one to care about the right something.
‘Working as intended’ is a simple phrase, but behind it lies a monstrously complex referent. It doesn’t clearly distinguish the programmers’ (mostly implicit) true preferences from their stated design objectives; an AI’s actual code can differ from either or both of these. Crucially, what an AI is ‘intended’ for isn’t all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to kill Friendliness much more easily than intelligence. Your argument is misleading because it trades on treating this simple phrase as though it were all-or-nothing, a monolith; but all failures for a device to ‘work as intended’ in human history have involved at least some of the intended properties of that device coming to fruition.
It may be hard to build self-modifying AGI. But it’s not the same hardness as the hardness of Friendliness Theory. As a programmer, being able to hit one small target doesn’t entail that you can or will hit every small target it would be in your best interest to hit. See the last section of my post above.
I suggest that it’s a straw man to claim that anyone has argued ‘the superintelligence wouldn’t understand what you wanted it to do, if you didn’t program it to fully understand that at the outset’. Do you have evidence that this is a position held by, say, anyone at MIRI? The post you’re replying to points out that the real claim is that the superintelligence won’t care what you wanted it to do, if you didn’t program it to care about the specific right thing at the outset. That makes your criticism seem very much like a change of topic.
Superintelligence may imply an ability to understand instructions, but it doesn’t imply a desire to rewrite one’s utility function to better reflect human values. Any such desire would need to come from the utility function itself, and if we’re worried that humans may get that utility function wrong, then we should also be worried that humans may get the part of the utility function that modifies the utility function wrong.
MIRI assumes that programming what you want an AI to do at the outset , Big Design Up Front, is a desirable feature for some reason.
The most common argument is that it is a necessary prerequisite for provable correctness, which is a desirable safety feature. OTOH, the exact opposite of massive hardcoding, goal flexibility is ielf a necessary prerequisite for corrigibility, which is itself a desirable safety feature.
The latter point has not been argued against adequately, IMO.
“The genie knows, but doesn’t care”
It’s like you haven’t read the OP at all.
I do not reject that step 10 does not follow if you reject that the AI will not “care” to learn what it is meant to do. But I believe there to be good reasons for an AI created by humans to care.
If you assume that this future software does not care, can you pinpoint when software stops caring?
1. Present-day software is better than previous software generations at understanding and doing what humans mean.
2. There will be future generations of software which will be better than the current generation at understanding and doing what humans mean.
3. If there is better software, there will be even better software afterwards.
4. …
5. Software will be superhuman good at understanding what humans mean but catastrophically worse than all previous generations at doing what humans mean.
What happens between step 3 and 5, and how do you justify it?
My guess is that you will write that there will not be a step 4, but instead a sudden transition from narrow AIs to something you call a seed AI, which is capable of making itself superhuman powerful in a very short time. And as I wrote in the comment you replied to, if I was to accept that assumption, then we would be in full agreement about AI risks. But I reject that assumption. I do not believe such a seed AI to be possible and believe that even if it was possible it would not work the way you think it would work. It would have to aquire information about what it is supposed to do, for pratical reasons.
Present day software is a series of increasing powerful narrow tools and abstractions. None of them encode anything remotely resembling the values of their users. Indeed, present-day software that tries to “do what you mean” is in my experience incredibly annoying and difficult to use, compared to software that simply presents a simple interface to a system with comprehensible mechanics.
Put simply, no software today cares about what you want. Furthermore, your general reasoning process here—define some vague measure of “software doing what you want”, observe an increasing trend line and extrapolate to a future situation—is exactly the kind of reasoning I always try to avoid, because it is usually misleading and heuristic.
Look at the actual mechanics of the situation. A program that literally wants to do what you mean is a complicated thing. No realistic progression of updates to Google Maps, say, gets anywhere close to building an accurate world-model describing its human users, plus having a built-in goal system that happens to specifically identify humans in its model and deduce their extrapolated goals. As EY has said, there is no ghost in the machine that checks your code to make sure it doesn’t make any “mistakes” like doing something the programmer didn’t intend. If it’s not programmed to care about what the programmer wanted, it won’t.
Is it just me, or does this sound like it could grow out of advertisement services? I think it’s the one industry that directly profits from generically modelling what users “want”¹and then delivering it to them.
[edit] ¹where “want” == “will click on and hopefully buy”
Do you believe that any kind of general intelligence is practically feasible that is not a collection of powerful narrow tools and abstractions? What makes you think so?
If all I care about is a list of Fibonacci numbers, what is the difference regarding the word “care” between a simple recursive algorithm and a general AI?
My measure of “software doing what you want” is not vague. I mean it quite literally. If I want software to output a series of Fibonacci numbers, and it does output a series of Fibonacci numbers, then it does what I want.
And what other than an increasing trend line do you suggest would be a rational means of extrapolation, sudden jumps and transitions?
Present day software may not have got far with regard to the evaluative side of doing what you want, but the XiXiDu’s point seems to be that it is getting better at the semantic side. Who was it who said the value problem is part of the semantic problem?
http://www.buzzfeed.com/jessicamisener/the-30-most-hilarious-autocorrect-struggles-ever
No fax or photocopier ever autocorrected your words from “meditating” to “masturbating”.
Every bit of additional functionality requires huge amounts of HUMAN development and testing, not in order to compile and run (that’s easy), but in order to WORK AS YOU WANT IT TO.
I can fully believe that a superhuman intelligence examining you will be fully capable of calculating “what you mean” “what you want” “what you fear” “what would be funniest for a buzzfeed artcle if I pretended to misunderstand your statement as meaning” “what would be best for you according to your values” “what would be best for you according to your cat’s values” “what would be best for you according to Genghis Khan’s values” .
No program now cares about what you mean. You’ve still not given any reason for the future software to care about “what you mean” over all those other calculation either.
I kind of doubt that autocorrect software really changed “meditating” to “masturbating”. Because of stuff like this. Edit: And because, start at the left and working rightward, they only share 1 letter before diverging, and because I’ve seen a spell-checker with special behavior for dirty/curse words (Not suggesting them as corrected spellings, but also not complaining about them as unrecognized words) (this is the one spell-checker which, out of curiousity, I decided to check its behavior with dirty/curse words, so I bet it’s common). Edit 2: Also from a causal history perspective of why a doubt it, rather than a normative justification perspective, there’s the fact that Yvain linked it and said something like “I don’t care if these are real.” Edit 3: typo.
To be fair, that is a fairly representative example of bad autocorrects. (I once had a text message autocorrect to “We are terrorist.”)
Meaning they don’t care about anything? They care about something else? What?
I’ll tell you one thing: the marketplace will select agents that act as if they care.
I agree that current software products fail, such as in your autocorrect example. But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting “meditating” to “masturbating”?
Imagine it would make similar mistakes in any of the problems that it is required to solve in order to overpower humanity. And if humans succeeded to make it not make such mistakes along the way to overpowering humanity, how did they selectively fail at making it want to overpower humanity in the first place? How likely is that?
Those are only ‘mistakes’ if you value human intentions. A grammatical error is only an error because we value the specific rules of grammar we do; it’s not the same sort of thing as a false belief (though it may stem from, or result in, false beliefs).
A machine programmed to terminally value the outputs of a modern-day autocorrect will never self-modify to improve on that algorithm or its outputs (because that would violate its terminal values). The fact that this seems silly to a human doesn’t provide any causal mechanism for the AI to change its core preferences. Have we successfully coded the AI not to do things that humans find silly, and to prize un-silliness before all other things? If not, then where will that value come from?
A belief can be factually wrong. A non-representational behavior (or dynamic) is never factually right or wrong, only normatively right or wrong. (And that normative wrongness only constrains what actually occurs to the extent the norm is one a sufficiently powerful agent in the vicinity actually holds.)
Maybe that distinction is the one that’s missing. You’re assuming that an AI will be capable of optimizing for true beliefs if and only if it is also optimizing for possessing human norms. But, by the is/ought distinction, there is no true beliefs about the physical world that will spontaneously force a being that believes it to become more virtuous, if it didn’t already have a relevant seed of virtue within itself.
It also looks like user Juno_Watt is some type of systematic troll, probably a sockpuppet for someone else, haven’t bothered investigating who.
I can’t work out how this relates to the thread it appears in.
Warning as before: XiXiDu = Alexander Kruel.
I’m confused as to the reason for the warning/outing, especially since the community seems to be doing an excellent job of dealing with his somewhat disjointed arguments. Downvotes, refutation, or banning in extreme cases are all viable forum-preserving responses. Publishing a dissenter’s name seems at best bad manners and at worst rather crass intimidation.
I only did a quick search on him and although some of the behavior was quite obnoxious, is there anything I’ve missed that justifies this?
XiXiDu wasn’t attempting or requesting anonymity—his LW profile openly lists his true name—and Alexander Kruel is someone with known problems (and a blog openly run under his true name) whom RobbBB might not know offhand was the same person as “XiXiDu” although this is public knowledge, nor might RobbBB realize that XiXiDu had the same irredeemable status as Loosemore.
I would not randomly out an LW poster for purposes of intimidation—I don’t think I’ve ever looked at a username’s associated private email address. Ever. Actually I’m not even sure offhand if our registration process requires/verifies that or not, since I was created as a pre-existing user at the dawn of time.
I do consider RobbBB’s work highly valuable and I don’t want him to feel disheartened by mistakenly thinking that a couple of eternal and irredeemable semitrolls are representative samples. Due to Civilizational Inadequacy, I don’t think it’s possible to ever convince the field of AI or philosophy of anything even as basic as the Orthogonality Thesis, but even I am not cynical enough to think that Loosemore or Kruel are representative samples.
Thanks, Eliezer! I knew who XiXiDu is. (And if I hadn’t, I think the content of his posts makes it easy to infer.)
There are a variety of reasons I find this discussion useful at the moment, and decided to stir it up. In particular, ground-floor disputes like this can be handy for forcing me to taboo inferential-gap-laden ideas and to convert premises I haven’t thought about at much length into actual arguments. But one of my reasons is not ‘I think this is representative of what serious FAI discussions look like (or ought to look like)’, no.
Glad to hear. It is interesting data that you managed to bring in 3 big name trolls for a single thread, considering their previous dispersion and lack of interest.
Kruel hasn’t threatened to sue anyone for calling him an idiot, at least!
Pardon me, I’ve missed something. Who has threatened to sue someone for calling him an idiot? I’d have liked to see the inevitable “truth” defence.
Link.
Thank you for the clarification. While I have a certain hesitance to throw around terms like “irredeemable”, I do understand the frustration with a certain, let’s say, overconfident and persistent brand of misunderstanding and how difficult it can be to maintain a public forum in its presence.
My one suggestion is that, if the goal was to avoid RobbBB’s (wonderfully high-quality comments, by the way) confusion, a private message might have been better. If the goal was more generally to minimize the confusion for those of us who are newer or less versed in LessWrong lore, more description might have been useful (“a known and persistent troll” or whatever) rather than just providing a name from the enemies list.
Agreed.
Though actually, Eliezer used similar phrasing regarding Richard Loosemore and got downvoted for it (not just by me). Admittedly, “persistent troll” is less extreme than “permanent idiot,” but even so, the statement could be phrased to be more useful.
I’d suggest, “We’ve presented similar arguments to [person] already, and [he or she] remained unconvinced. Ponder carefully before deciding to spend much time arguing with [him or her].”
Not only is it less offensive this way, it does a better job of explaining itself. (Note: the “ponder carefully” section is quoting Eliezer; that part of his post was fine.)
Who has twice sworn off commenting on LW. So much for pre-commitments.
You will see a grammatical error as a mistake if you value grammar in general, or if you value being right in general.
A self-improving AI needs a goal. A goal of self-improvement alone would work. A goal of getting things right in general would work too, and be much safer, as it would include getting our intentions right as a sub-goal.
Although since “self-improvement” in this context basically refers to “improving your ability to accomplish goals”...
Stop me if this is a non-secteur, but surely “having accurate beliefs” and “acting on those beliefs in a particular way” are completely different things? I haven’t really been following this conversation, though.
As Robb said you’re confusing mistake in the sense of “The program is doing something we don’t want to do” with mistake in the sense of “The program has wrong beliefs about reality”.
I suppose a different way of thinking about these is “A mistaken human belief about the program” vs “A mistaken computer belief about the human”. We keep talking about the former (the program does something we didn’t know it would do), and you keep treating it as if it’s the latter.
Let’s say we have a program (not an AI, just a program) which uses Newton’s laws in order to calculate the trajectory of a ball. We want it to calculate this in order to have it move a tennis racket and hit the ball back. When it finally runs, we observe that the program always avoids the ball rather than hit it back. Is it because it’s calculating the trajectory of the ball wrongly? No, it calculates the trajectory very well indeed, it’s just that an instruction in the program was wrongly inserted so that the end result is “DO NOT hit the ball back”.
It knows what the “trajectory of the ball” is. It knows what “hit the ball” is. But it’s program is “DO NOT hit the ball” rather than “hit the ball”. Why? Because of a human mistaken belief on what the program would do, not the program’s mistaken belief.
And you are confusing self-improving AIs with conventional programmes.
To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI:
Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI.
(1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree?
NAI: True
GAI: True
(2) Under what circumstances does it fail to behave in accordance with human intention?
NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations.
GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
(4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time?
NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
Please let me also know where you disagree with the following points:
(1) The abilities of systems are part of human preferences as humans intend to give systems certain capabilities and, as a prerequisite to build such systems, have to succeed at implementing their intentions.
(2) Error detection and prevention is such a capability.
(3) Something that is not better than humans at preventing errors is no existential risk.
(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.
(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.
To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI:
Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI.
(1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree?
NAI: True
GAI: True
(2) Under what circumstances does it fail to behave in accordance with human intention?
NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations.
GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
(4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time?
NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
Please let me also know where you disagree with the following points:
(1) The abilities of systems are part of human preferences as humans intend to give systems certain capabilities and, as a prerequisite to build such systems, have to succeed at implementing their intentions.
(2) Error detection and prevention is such a capability.
(3) Something that is not better than humans at preventing errors is no existential risk.
(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.
(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.
GAI is a program. It always does what it’s programmed to do. That’s the problem—a program that was written incorrectly will generally never do what it was intended to do.
FWIW, I find your statements 3,4,5 also highly objectionable, on the grounds that you are lumping a large class of things under the blank label “errors”. Is an “error” doing something that humans don’t want? Is it doing something the agent doesn’t want? Is it accidentally mistyping a letter in a program, causing a syntax error, or thinking about something heuristically and coming to the wrong conclusion, then making carefully planned decision based on that mistake? Automatic proof systems don’t save you if you what you think you need to prove isn’t actually what you need to prove.
So self-correcting software is impossible. Is self improving software possible?
Self-correcting software is possible if there’s a correct implementation of what “correctness” means, and the module that has the correct implementation has control over the modules that don’t have the correct implementation.
Self-improving software are likewise possible if there’s a correct implementation of the definition of “improvement”.
Right now, I’m guessing that it’d be relatively easy to programmatically define “performance improvement” and difficult to define “moral and ethical improvement”.
First list:
1) Poorly defined terms “human intention” and “sufficient”.
2) Possibly under any circumstances whatsoever, if it’s anything like other non-trivial software, which always has some bugs.
3) Anything from “you may not notice” to “catastrophic failure resulting in deaths”. Claim that failure of software to work as humans intend will “generally fail in a way that is harmful to it’s own functioning” is unsupported. E.g. a spreadsheet works fine if the floating point math is off in the 20th bit of the mantissa. The answers will be wrong, but there is nothing about that that the spreadsheet could be expected to care about,
4) Not necessarily. GAI may continue to try to do what it was programmed to do, and only unintentionally destroy a small city in the process :)
Second list:
1) Wrong. The abilities of sufficiently complex systems are a huge space of events humans haven’t thought about yet, and so do not yet have preferences about. There is no way to know what their preferences would or should be for many many outcomes.
2) Error as failure to perform the requested action may take precedence over error as failure to anticipate hypothetical objections from some humans to something they hadn’t expected. For one thing, it is more clearly defined. We already know human-level intelligences act this way.
3) Asteroids and supervolcanoes are not better than humans at preventing errors. It is perfectly possible for something stupid to be able to kill you. Therefore something with greater cognitive and material resources than you, but still with the capacity to make mistakes can certainly kill you. For example, a government.
4) It is already possible for a very fallible human to make something that is better than humans at detecting certain kinds of errors.
5) No. Unless by dramatic you mean “impossibly perfect, magical and universal”.
Two points:
Firstly, “fails in a way that is harmful to its own functioning” appears to be tautological.
Secondly, you seem to be listing things that apply to any kind of AI in the NAI section—is this intentional? (This happens throughout your comment, in fact.)
Software that cares what you mean will be selected for by market forces.
Software that initially appears to care what you mean will be selected by market forces. But nearly all software that superficially looks Friendly isn’t Friendly. If there are seasoned AI researchers who can’t wrap their heads around the five theses, then how can I be confident that the Invisible Hand will both surpass them intellectually and recurrently sacrifice short-term gains on this basis?
Software that looks friendly isn’t really friendly in the sense that it really understands what we want. It isn’t dangerously unfriendly because we’re still here. If its commercially successful, it’s friendly enough for us to want it in our lives.
Human beings aren’t friendly, in the Friendly-AI sense. If a random human acquired immense power, it would probably result in an existential catastrophe. Humans do have a better sense of human value than, say, a can-opener does; they have more power and autonomy than a can-opener, so they need fuller access to human values in order to reach similar safety levels. A superintelligent AI would require even more access to human values to reach comparable safety levels.
There is more than one sense to friendly .AI.
If you grafted absolute power onto a human with average ethical insight, you might get absolute corruption. But what is that analogous to in .AI terms? Why assume asymmetric development by default?
If you assume top down singleton AI with a walled of ethics module, things look difficult. If you reverse this assumptions, FAI is already happening.
XiXiDu, I get the impression you’ve never coded anything. Is that accurate?
Present-day everyday software (e.g. Google Maps, Siri) is better at doing what humans mean. It is not better at understanding humans. Learning programs like the one that runs PARO appear to be good at understanding humans, but are actually following a very simple utility function (in the decision sense, not the experiental sense); they change their behaviour in response to programmed cues, generally by doing more/less of actions associated with those cues (example: PARO “likes” being stroked and will do more of things that tend to preceed stroking). In each case of a program that improves itself, it has a simple thing it “wants” to optimise and makes changes according to how well it seems to be doing.
Making software that understands humans at all is beyond our current capabilities. Theory of mind, the ability to recognise agents and see them as having desires of their own, is something we have no idea how to produce; we don’t even know how humans have it. General intelligence is an enormous step beyond programming something like Siri. Siri is “just” interpreting vocal commands as text (which requires no general intelligence), matching that to a list of question structures (which requires no general intelligence; Siri does not have to understand what the word “where” means to know that Google Maps may be useful for that type of question) and delegating to Web services, with a layer of learning code to produce more of the results you liked (i.e., that made you stop asking related questions) in the past. Siri is using a very small built-in amount of knowledge and an even smaller amount of learned knowledge to fake understanding, but it’s just pattern-matching. While the second step is the root of general intelligence, it’s almost all provided by humans who understood that “where” means a question is probably to do with geography; Siri’s ability to improve this step is virtually nonexistent.
The more powerful something is, the more dangerous it is. A very stupid adult is much more dangerous than a very intelligent child because adults are allowed to drive cars. Driving a car requires very little intelligence and no general intelligence whatsoever (we already have robots that can do a pretty good job), but can go catastrophically wrong very easily. Holding an intelligent conversation requires huge amounts of specialised intelligence and often requires general intelligence, but nothing a four-year-old says is likely to kill people.
It’s much easier to make a program that does a good job at task-completion, and is therefore given considerable power and autonomy (Siri, for example), than it is to make sure that the program never does stupid things with its power. Developing software we already have could easily lead to programs being assigned large amounts of power (e.g., “Siri 2, buy me a ticket to New York”, which would almost always produce the appropriate kind of ticket), but I certainly wouldn’t trust such programs to never make colossal screw-ups. (Siri 2 will only tell you that you can’t afford a ticket if a human programmer thought that might be important, because Siri 2 does not care that you need to buy groceries, because it does not understand that you exist.)
I hope I have convinced you that present software only fakes understanding and that developing it will not produce software that can do better than an intelligent human with the same resources. Siri 2 will not be more than a very useful tool, and neither will Siri 5. Software does not stop caring because it has never cared.
It is very easy (relatively speaking) to produce code that can fake understanding and act like it cares about your objectives, because this merely requires a good outline of the sort of things the code is likely to be wanted for. (This is the second stage of Siri outlined above, where Siri refers to a list saying that “where” means that Google Maps is probably the best service to outsource to.) Making code that does more of the things that get good results is also very easy.
Making code that actually cares requires outlining exactly what the code is really and truly wanted to do. You can’t delegate this step by saying “Learn what I care about and then satisfy me” because that’s just changing what you want the code to do. It might or might not be easier than saying “This is what I care about, satisfy me”, but at some stage you have to say what you want done exactly right or the code will do something else. (Currently getting it wrong is pretty safe because computers have little autonomy and very little general intelligence, so they mostly do nothing much; getting it wrong with a UFAI is dangerous because the AI will succeed at doing the wrong thing, probably on a big scale.) This is the only kind of code you can trust to program itself and to have significant power, because it’s the only kind that will modify itself right.
You can’t progress Siri into an FAI, no matter how much you know about producing general intelligence. You need to know either Meaning-in-General, Preferences-in-General or exactly what Human Prefernces are, or you won’t get what you hoped for.
Another perspective: the number of humans in history who were friendly is very, very small. The number of humans who are something resembling capital-F Friendly is virtually nil. Why should “an AI created by humans to care” be Friendly, or even friendly? Unless friendliness or Friendliness is your specific goal, you’ll probably produce software that is friendly-to-the-maker (or maybe Friendly-to-the-maker, if making Friendly code really is as easy as you seem to think). Who would you trust with a superintelligence that did exactly what they said? Who would you trust with a superintelligence that did exactly what they really wanted, not what they said? I wouldn’t trust my mother with either, and she’s certainly highly intelligent and has my best interests at heart. I’d need a fair amount of convincing to trust me with either. Most humans couldn’t program AIs that care because most humans don’t care themselves, let alone know how to express it.
So you believe that “understanding” is an all or nothing capability? I did never intend to use “understanding” like this. My use of the term is such that if my speech recognition software correctly transcribes 98% of what I am saying then it is better at understanding how certain sounds are related to certain strings of characters than a software that correctly transcribes 95% of what I said.
One enormous step or a huge number of steps? If the former, what makes you think so? If the latter, then at what point do better versions of Siri start acting in catastrophic ways?
Most of what humans understand is provided by other humans who themselves got another cruder version from other humans.
If an AI is not supposed to take over the world, then from the perspective of humans it is mistaken to take over the world. Humans got something wrong about the AI design if it takes over the world. Now if needs to solve a minimum of N problems correctly in order to take over the world, then this means that it succeeded N times at being general intelligent at executing a stupid thing. The question that arises here is whether it is more likely for humans to build an AI that works perfectly well along a number of dimensions at doing a stupid thing than an AI that fails at doing a stupid thing because it does other stupid things as well?
Sure, I do not disagree with this at all. AI will very likely lead to catastrophic events. I merely disagree with the dumb superintelligence scenario.
In other words, humans are likely to fail at AI in such a way that it works perfectly well in a catastrophic way.
I certainly do not reject that general AI is extremely dangerous in the hands of unfriendly humans and that only a friendly AI that takes over the world could eventually prevent a catastrophe. I am rejecting the dumb superintelligence scenario.