Software will be superhuman good at understanding what humans mean but catastrophically worse than all previous generations at doing what humans mean.
Every bit of additional functionality requires huge amounts of HUMAN development and testing, not in order to compile and run (that’s easy), but in order to WORK AS YOU WANT IT TO.
I can fully believe that a superhuman intelligence examining you will be fully capable of calculating “what you mean” “what you want” “what you fear” “what would be funniest for a buzzfeed artcle if I pretended to misunderstand your statement as meaning” “what would be best for you according to your values” “what would be best for you according to your cat’s values” “what would be best for you according to Genghis Khan’s values” .
No program now cares about what you mean. You’ve still not given any reason for the future software to care about “what you mean” over all those other calculation either.
I kind of doubt that autocorrect software really changed “meditating” to “masturbating”. Because of stuff like this. Edit: And because, start at the left and working rightward, they only share 1 letter before diverging, and because I’ve seen a spell-checker with special behavior for dirty/curse words (Not suggesting them as corrected spellings, but also not complaining about them as unrecognized words) (this is the one spell-checker which, out of curiousity, I decided to check its behavior with dirty/curse words, so I bet it’s common). Edit 2: Also from a causal history perspective of why a doubt it, rather than a normative justification perspective, there’s the fact that Yvain linked it and said something like “I don’t care if these are real.” Edit 3: typo.
No program now cares about what you mean. You’ve still not given any reason for the future software to care about “what you mean” over all those other calculation either.
I agree that current software products fail, such as in your autocorrect example. But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting “meditating” to “masturbating”?
Imagine it would make similar mistakes in any of the problems that it is required to solve in order to overpower humanity. And if humans succeeded to make it not make such mistakes along the way to overpowering humanity, how did they selectively fail at making it want to overpower humanity in the first place? How likely is that?
But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting “meditating” to “masturbating”?
Those are only ‘mistakes’ if you value human intentions. A grammatical error is only an error because we value the specific rules of grammar we do; it’s not the same sort of thing as a false belief (though it may stem from, or result in, false beliefs).
A machine programmed to terminally value the outputs of a modern-day autocorrect will never self-modify to improve on that algorithm or its outputs (because that would violate its terminal values). The fact that this seems silly to a human doesn’t provide any causal mechanism for the AI to change its core preferences. Have we successfully coded the AI not to do things that humans find silly, and to prize un-silliness before all other things? If not, then where will that value come from?
A belief can be factually wrong. A non-representational behavior (or dynamic) is never factually right or wrong, only normatively right or wrong. (And that normative wrongness only constrains what actually occurs to the extent the norm is one a sufficiently powerful agent in the vicinity actually holds.)
Maybe that distinction is the one that’s missing. You’re assuming that an AI will be capable of optimizing for true beliefs if and only if it is also optimizing for possessing human norms. But, by the is/ought distinction, there is no true beliefs about the physical world that will spontaneously force a being that believes it to become more virtuous, if it didn’t already have a relevant seed of virtue within itself.
I’m confused as to the reason for the warning/outing, especially since the community seems to be doing an excellent job of dealing with his somewhat disjointed arguments. Downvotes, refutation, or banning in extreme cases are all viable forum-preserving responses. Publishing a dissenter’s name seems at best bad manners and at worst rather crass intimidation.
I only did a quick search on him and although some of the behavior was quite obnoxious, is there anything I’ve missed that justifies this?
XiXiDu wasn’t attempting or requesting anonymity—his LW profile openly lists his true name—and Alexander Kruel is someone with known problems (and a blog openly run under his true name) whom RobbBB might not know offhand was the same person as “XiXiDu” although this is public knowledge, nor might RobbBB realize that XiXiDu had the same irredeemable status as Loosemore.
I would not randomly out an LW poster for purposes of intimidation—I don’t think I’ve ever looked at a username’s associated private email address. Ever. Actually I’m not even sure offhand if our registration process requires/verifies that or not, since I was created as a pre-existing user at the dawn of time.
I do consider RobbBB’s work highly valuable and I don’t want him to feel disheartened by mistakenly thinking that a couple of eternal and irredeemable semitrolls are representative samples. Due to Civilizational Inadequacy, I don’t think it’s possible to ever convince the field of AI or philosophy of anything even as basic as the Orthogonality Thesis, but even I am not cynical enough to think that Loosemore or Kruel are representative samples.
Thanks, Eliezer! I knew who XiXiDu is. (And if I hadn’t, I think the content of his posts makes it easy to infer.)
There are a variety of reasons I find this discussion useful at the moment, and decided to stir it up. In particular, ground-floor disputes like this can be handy for forcing me to taboo inferential-gap-laden ideas and to convert premises I haven’t thought about at much length into actual arguments. But one of my reasons is not ‘I think this is representative of what serious FAI discussions look like (or ought to look like)’, no.
Glad to hear. It is interesting data that you managed to bring in 3 big name trolls for a single thread, considering their previous dispersion and lack of interest.
Thank you for the clarification. While I have a certain hesitance to throw around terms like “irredeemable”, I do understand the frustration with a certain, let’s say, overconfident and persistent brand of misunderstanding and how difficult it can be to maintain a public forum in its presence.
My one suggestion is that, if the goal was to avoid RobbBB’s (wonderfully high-quality comments, by the way) confusion, a private message might have been better. If the goal was more generally to minimize the confusion for those of us who are newer or less versed in LessWrong lore, more description might have been useful (“a known and persistent troll” or whatever) rather than just providing a name from the enemies list.
Though actually, Eliezer used similar phrasing regarding Richard Loosemore and got downvoted for it (not just by me). Admittedly, “persistent troll” is less extreme than “permanent idiot,” but even so, the statement could be phrased to be more useful.
I’d suggest, “We’ve presented similar arguments to [person] already, and [he or she] remained unconvinced. Ponder carefully before deciding to spend much time arguing with [him or her].”
Not only is it less offensive this way, it does a better job of explaining itself. (Note: the “ponder carefully” section is quoting Eliezer; that part of his post was fine.)
Those are only ‘mistakes’ if you value human intentions. A grammatical error is only an error because we value the specific rules of grammar we do; it’s not the same sort of thing as a false belief (though it may stem from, or result in, false beliefs).
You will see a grammatical error as a mistake if you value grammar in general, or if you value being right in general.
A self-improving AI needs a goal. A goal of self-improvement alone would work. A goal of getting things right in general would work too, and be much safer, as it would include getting our intentions right as a sub-goal.
Although since “self-improvement” in this context basically refers to “improving your ability to accomplish goals”...
You will see a grammatical error as a mistake if you value grammar in general, or if you value being right in general.
Stop me if this is a non-secteur, but surely “having accurate beliefs” and “acting on those beliefs in a particular way” are completely different things? I haven’t really been following this conversation, though.
But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting “meditating” to “masturbating”?
As Robb said you’re confusing mistake in the sense of “The program is doing something we don’t want to do” with mistake in the sense of “The program has wrong beliefs about reality”.
I suppose a different way of thinking about these is “A mistaken human belief about the program” vs “A mistaken computer belief about the human”. We keep talking about the former (the program does something we didn’t know it would do), and you keep treating it as if it’s the latter.
Let’s say we have a program (not an AI, just a program) which uses Newton’s laws in order to calculate the trajectory of a ball. We want it to calculate this in order to have it move a tennis racket and hit the ball back.
When it finally runs, we observe that the program always avoids the ball rather than hit it back. Is it because it’s calculating the trajectory of the ball wrongly? No, it calculates the trajectory very well indeed, it’s just that an instruction in the program was wrongly inserted so that the end result is “DO NOT hit the ball back”.
It knows what the “trajectory of the ball” is. It knows what “hit the ball” is. But it’s program is “DO NOT hit the ball” rather than “hit the ball”. Why? Because of a human mistaken belief on what the program would do, not the program’s mistaken belief.
To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI:
Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI.
(1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree?
NAI: True
GAI: True
(2) Under what circumstances does it fail to behave in accordance with human intention?
NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations.
GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
(4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time?
NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
Please let me also know where you disagree with the following points:
(1) The abilities of systems are part of human preferences as humans intend to give systems certain capabilities and, as a prerequisite to build such systems, have to succeed at implementing their intentions.
(2) Error detection and prevention is such a capability.
(3) Something that is not better than humans at preventing errors is no existential risk.
(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.
(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.
To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI:
Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI.
(1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree?
NAI: True
GAI: True
(2) Under what circumstances does it fail to behave in accordance with human intention?
NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations.
GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
(4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time?
NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
Please let me also know where you disagree with the following points:
(1) The abilities of systems are part of human preferences as humans intend to give systems certain capabilities and, as a prerequisite to build such systems, have to succeed at implementing their intentions.
(2) Error detection and prevention is such a capability.
(3) Something that is not better than humans at preventing errors is no existential risk.
(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.
(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
GAI is a program. It always does what it’s programmed to do. That’s the problem—a program that was written incorrectly will generally never do what it was intended to do.
FWIW, I find your statements 3,4,5 also highly objectionable, on the grounds that you are lumping a large class of things under the blank label “errors”. Is an “error” doing something that humans don’t want? Is it doing something the agent doesn’t want? Is it accidentally mistyping a letter in a program, causing a syntax error, or thinking about something heuristically and coming to the wrong conclusion, then making carefully planned decision based on that mistake? Automatic proof systems don’t save you if you what you think you need to prove isn’t actually what you need to prove.
GAI is a program. It always does what it’s programmed to do. That’s the problem—a program that was written incorrectly will generally never do what it was intended to do.
So self-correcting software is impossible. Is self improving software possible?
Self-correcting software is possible if there’s a correct implementation of what “correctness” means, and the module that has the correct implementation has control over the modules that don’t have the correct implementation.
Self-improving software are likewise possible if there’s a correct implementation of the definition of “improvement”.
Right now, I’m guessing that it’d be relatively easy to programmatically define “performance improvement” and difficult to define “moral and ethical improvement”.
1) Poorly defined terms “human intention” and “sufficient”.
2) Possibly under any circumstances whatsoever, if it’s anything like other non-trivial software, which always has some bugs.
3) Anything from “you may not notice” to “catastrophic failure resulting in deaths”. Claim that failure of software to work as humans intend will “generally fail in a way that is harmful to it’s own functioning” is unsupported. E.g. a spreadsheet works fine if the floating point math is off in the 20th bit of the mantissa. The answers will be wrong, but there is nothing about that that the spreadsheet could be expected to care about,
4) Not necessarily. GAI may continue to try to do what it was programmed to do, and only unintentionally destroy a small city in the process :)
Second list:
1) Wrong. The abilities of sufficiently complex systems are a huge space of events humans haven’t thought about yet, and so do not yet have preferences about. There is no way to know what their preferences would or should be for many many outcomes.
2) Error as failure to perform the requested action may take precedence over error as failure to anticipate hypothetical objections from some humans to something they hadn’t expected. For one thing, it is more clearly defined. We already know human-level intelligences act this way.
3) Asteroids and supervolcanoes are not better than humans at preventing errors. It is perfectly possible for something stupid to be able to kill you. Therefore something with greater cognitive and material resources than you, but still with the capacity to make mistakes can certainly kill you. For example, a government.
4) It is already possible for a very fallible human to make something that is better than humans at detecting certain kinds of errors.
5) No. Unless by dramatic you mean “impossibly perfect, magical and universal”.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
Firstly, “fails in a way that is harmful to its own functioning” appears to be tautological.
Secondly, you seem to be listing things that apply to any kind of AI in the NAI section—is this intentional? (This happens throughout your comment, in fact.)
Software that initially appears to care what you mean will be selected by market forces. But nearly all software that superficially looks Friendly isn’t Friendly. If there are seasoned AI researchers who can’t wrap their heads around the five theses, then how can I be confident that the Invisible Hand will both surpass them intellectually and recurrently sacrifice short-term gains on this basis?
Software that looks friendly isn’t really friendly in the sense that it really understands what we want. It isn’t dangerously unfriendly because we’re still here. If its commercially successful, it’s friendly enough for us to want it in our lives.
Human beings aren’t friendly, in the Friendly-AI sense. If a random human acquired immense power, it would probably result in an existential catastrophe. Humans do have a better sense of human value than, say, a can-opener does; they have more power and autonomy than a can-opener, so they need fuller access to human values in order to reach similar safety levels. A superintelligent AI would require even more access to human values to reach comparable safety levels.
If you grafted absolute power onto a human with average ethical insight, you might get absolute corruption. But what is that analogous to in .AI terms? Why assume asymmetric development by default?
If you assume top down singleton AI with a walled of ethics module, things look difficult. If you reverse this assumptions, FAI is already happening.
http://www.buzzfeed.com/jessicamisener/the-30-most-hilarious-autocorrect-struggles-ever
No fax or photocopier ever autocorrected your words from “meditating” to “masturbating”.
Every bit of additional functionality requires huge amounts of HUMAN development and testing, not in order to compile and run (that’s easy), but in order to WORK AS YOU WANT IT TO.
I can fully believe that a superhuman intelligence examining you will be fully capable of calculating “what you mean” “what you want” “what you fear” “what would be funniest for a buzzfeed artcle if I pretended to misunderstand your statement as meaning” “what would be best for you according to your values” “what would be best for you according to your cat’s values” “what would be best for you according to Genghis Khan’s values” .
No program now cares about what you mean. You’ve still not given any reason for the future software to care about “what you mean” over all those other calculation either.
I kind of doubt that autocorrect software really changed “meditating” to “masturbating”. Because of stuff like this. Edit: And because, start at the left and working rightward, they only share 1 letter before diverging, and because I’ve seen a spell-checker with special behavior for dirty/curse words (Not suggesting them as corrected spellings, but also not complaining about them as unrecognized words) (this is the one spell-checker which, out of curiousity, I decided to check its behavior with dirty/curse words, so I bet it’s common). Edit 2: Also from a causal history perspective of why a doubt it, rather than a normative justification perspective, there’s the fact that Yvain linked it and said something like “I don’t care if these are real.” Edit 3: typo.
To be fair, that is a fairly representative example of bad autocorrects. (I once had a text message autocorrect to “We are terrorist.”)
Meaning they don’t care about anything? They care about something else? What?
I’ll tell you one thing: the marketplace will select agents that act as if they care.
I agree that current software products fail, such as in your autocorrect example. But how could a seed AI be able to make itself superhuman powerful if it did not care about avoiding mistakes such as autocoreccting “meditating” to “masturbating”?
Imagine it would make similar mistakes in any of the problems that it is required to solve in order to overpower humanity. And if humans succeeded to make it not make such mistakes along the way to overpowering humanity, how did they selectively fail at making it want to overpower humanity in the first place? How likely is that?
Those are only ‘mistakes’ if you value human intentions. A grammatical error is only an error because we value the specific rules of grammar we do; it’s not the same sort of thing as a false belief (though it may stem from, or result in, false beliefs).
A machine programmed to terminally value the outputs of a modern-day autocorrect will never self-modify to improve on that algorithm or its outputs (because that would violate its terminal values). The fact that this seems silly to a human doesn’t provide any causal mechanism for the AI to change its core preferences. Have we successfully coded the AI not to do things that humans find silly, and to prize un-silliness before all other things? If not, then where will that value come from?
A belief can be factually wrong. A non-representational behavior (or dynamic) is never factually right or wrong, only normatively right or wrong. (And that normative wrongness only constrains what actually occurs to the extent the norm is one a sufficiently powerful agent in the vicinity actually holds.)
Maybe that distinction is the one that’s missing. You’re assuming that an AI will be capable of optimizing for true beliefs if and only if it is also optimizing for possessing human norms. But, by the is/ought distinction, there is no true beliefs about the physical world that will spontaneously force a being that believes it to become more virtuous, if it didn’t already have a relevant seed of virtue within itself.
It also looks like user Juno_Watt is some type of systematic troll, probably a sockpuppet for someone else, haven’t bothered investigating who.
I can’t work out how this relates to the thread it appears in.
Warning as before: XiXiDu = Alexander Kruel.
I’m confused as to the reason for the warning/outing, especially since the community seems to be doing an excellent job of dealing with his somewhat disjointed arguments. Downvotes, refutation, or banning in extreme cases are all viable forum-preserving responses. Publishing a dissenter’s name seems at best bad manners and at worst rather crass intimidation.
I only did a quick search on him and although some of the behavior was quite obnoxious, is there anything I’ve missed that justifies this?
XiXiDu wasn’t attempting or requesting anonymity—his LW profile openly lists his true name—and Alexander Kruel is someone with known problems (and a blog openly run under his true name) whom RobbBB might not know offhand was the same person as “XiXiDu” although this is public knowledge, nor might RobbBB realize that XiXiDu had the same irredeemable status as Loosemore.
I would not randomly out an LW poster for purposes of intimidation—I don’t think I’ve ever looked at a username’s associated private email address. Ever. Actually I’m not even sure offhand if our registration process requires/verifies that or not, since I was created as a pre-existing user at the dawn of time.
I do consider RobbBB’s work highly valuable and I don’t want him to feel disheartened by mistakenly thinking that a couple of eternal and irredeemable semitrolls are representative samples. Due to Civilizational Inadequacy, I don’t think it’s possible to ever convince the field of AI or philosophy of anything even as basic as the Orthogonality Thesis, but even I am not cynical enough to think that Loosemore or Kruel are representative samples.
Thanks, Eliezer! I knew who XiXiDu is. (And if I hadn’t, I think the content of his posts makes it easy to infer.)
There are a variety of reasons I find this discussion useful at the moment, and decided to stir it up. In particular, ground-floor disputes like this can be handy for forcing me to taboo inferential-gap-laden ideas and to convert premises I haven’t thought about at much length into actual arguments. But one of my reasons is not ‘I think this is representative of what serious FAI discussions look like (or ought to look like)’, no.
Glad to hear. It is interesting data that you managed to bring in 3 big name trolls for a single thread, considering their previous dispersion and lack of interest.
Kruel hasn’t threatened to sue anyone for calling him an idiot, at least!
Pardon me, I’ve missed something. Who has threatened to sue someone for calling him an idiot? I’d have liked to see the inevitable “truth” defence.
Link.
Thank you for the clarification. While I have a certain hesitance to throw around terms like “irredeemable”, I do understand the frustration with a certain, let’s say, overconfident and persistent brand of misunderstanding and how difficult it can be to maintain a public forum in its presence.
My one suggestion is that, if the goal was to avoid RobbBB’s (wonderfully high-quality comments, by the way) confusion, a private message might have been better. If the goal was more generally to minimize the confusion for those of us who are newer or less versed in LessWrong lore, more description might have been useful (“a known and persistent troll” or whatever) rather than just providing a name from the enemies list.
Agreed.
Though actually, Eliezer used similar phrasing regarding Richard Loosemore and got downvoted for it (not just by me). Admittedly, “persistent troll” is less extreme than “permanent idiot,” but even so, the statement could be phrased to be more useful.
I’d suggest, “We’ve presented similar arguments to [person] already, and [he or she] remained unconvinced. Ponder carefully before deciding to spend much time arguing with [him or her].”
Not only is it less offensive this way, it does a better job of explaining itself. (Note: the “ponder carefully” section is quoting Eliezer; that part of his post was fine.)
Who has twice sworn off commenting on LW. So much for pre-commitments.
You will see a grammatical error as a mistake if you value grammar in general, or if you value being right in general.
A self-improving AI needs a goal. A goal of self-improvement alone would work. A goal of getting things right in general would work too, and be much safer, as it would include getting our intentions right as a sub-goal.
Although since “self-improvement” in this context basically refers to “improving your ability to accomplish goals”...
Stop me if this is a non-secteur, but surely “having accurate beliefs” and “acting on those beliefs in a particular way” are completely different things? I haven’t really been following this conversation, though.
As Robb said you’re confusing mistake in the sense of “The program is doing something we don’t want to do” with mistake in the sense of “The program has wrong beliefs about reality”.
I suppose a different way of thinking about these is “A mistaken human belief about the program” vs “A mistaken computer belief about the human”. We keep talking about the former (the program does something we didn’t know it would do), and you keep treating it as if it’s the latter.
Let’s say we have a program (not an AI, just a program) which uses Newton’s laws in order to calculate the trajectory of a ball. We want it to calculate this in order to have it move a tennis racket and hit the ball back. When it finally runs, we observe that the program always avoids the ball rather than hit it back. Is it because it’s calculating the trajectory of the ball wrongly? No, it calculates the trajectory very well indeed, it’s just that an instruction in the program was wrongly inserted so that the end result is “DO NOT hit the ball back”.
It knows what the “trajectory of the ball” is. It knows what “hit the ball” is. But it’s program is “DO NOT hit the ball” rather than “hit the ball”. Why? Because of a human mistaken belief on what the program would do, not the program’s mistaken belief.
And you are confusing self-improving AIs with conventional programmes.
To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI:
Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI.
(1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree?
NAI: True
GAI: True
(2) Under what circumstances does it fail to behave in accordance with human intention?
NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations.
GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
(4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time?
NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
Please let me also know where you disagree with the following points:
(1) The abilities of systems are part of human preferences as humans intend to give systems certain capabilities and, as a prerequisite to build such systems, have to succeed at implementing their intentions.
(2) Error detection and prevention is such a capability.
(3) Something that is not better than humans at preventing errors is no existential risk.
(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.
(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.
To be better able to respond to your comment, please let me know in what way you disagree with the following comparison between narrow AI and general AI:
Narrow artificial intelligence will be denoted NAI and general artificial intelligence GAI.
(1) Is it in principle capable of behaving in accordance with human intention to a sufficient degree?
NAI: True
GAI: True
(2) Under what circumstances does it fail to behave in accordance with human intention?
NAI: If it is broken, where broken stands for a wide range of failure modes such as incorrectly managing memory allocations.
GAI: In all cases in which it is not mathematically proven to be tasked with the protection of, and equipped with, a perfect encoding of all human values or a safe way to obtain such an encoding.
(3) What happens when it fails to behave in accordance with human intention?
NAI: It crashes, freezes or halts. It generally fails in a way that is harmful to its own functioning. If for example an autonomous car fails at driving autonomously it usually means that it will either go into safe-mode and halt or crash.
GAI: It works perfectly well. Superhumanly well. All its intended capabilities are intact except that it completely fails at working as intended in such a way as to destroy all human value in the universe. It will be able to improve itself and capable of obtaining a perfect encoding of human values. It will use those intended capabilities in order to deceive and overpower humans rather than doing what it was intended to do.
(4) What happens if it is bound to use a limited amount of resources, use a limited amount of space or run for a limited amount of time?
NAI: It will only ever do what it was programmed to do. As long as there is no fatal flaw, harming its general functionality, it will work within the defined boundaries as intended.
GAI: It will never do what it was programmed to do and always remove or bypass its intended limitations in order to pursue unintended actions such as taking over the universe.
Please let me also know where you disagree with the following points:
(1) The abilities of systems are part of human preferences as humans intend to give systems certain capabilities and, as a prerequisite to build such systems, have to succeed at implementing their intentions.
(2) Error detection and prevention is such a capability.
(3) Something that is not better than humans at preventing errors is no existential risk.
(4) Without a dramatic increase in the capacity to detect and prevent errors it will be impossible to create something that is better than humans at preventing errors.
(5) A dramatic increase in the human capacity to detect and prevent errors is incompatible with the creation of something that constitutes an existential risk as a result of human error.
GAI is a program. It always does what it’s programmed to do. That’s the problem—a program that was written incorrectly will generally never do what it was intended to do.
FWIW, I find your statements 3,4,5 also highly objectionable, on the grounds that you are lumping a large class of things under the blank label “errors”. Is an “error” doing something that humans don’t want? Is it doing something the agent doesn’t want? Is it accidentally mistyping a letter in a program, causing a syntax error, or thinking about something heuristically and coming to the wrong conclusion, then making carefully planned decision based on that mistake? Automatic proof systems don’t save you if you what you think you need to prove isn’t actually what you need to prove.
So self-correcting software is impossible. Is self improving software possible?
Self-correcting software is possible if there’s a correct implementation of what “correctness” means, and the module that has the correct implementation has control over the modules that don’t have the correct implementation.
Self-improving software are likewise possible if there’s a correct implementation of the definition of “improvement”.
Right now, I’m guessing that it’d be relatively easy to programmatically define “performance improvement” and difficult to define “moral and ethical improvement”.
First list:
1) Poorly defined terms “human intention” and “sufficient”.
2) Possibly under any circumstances whatsoever, if it’s anything like other non-trivial software, which always has some bugs.
3) Anything from “you may not notice” to “catastrophic failure resulting in deaths”. Claim that failure of software to work as humans intend will “generally fail in a way that is harmful to it’s own functioning” is unsupported. E.g. a spreadsheet works fine if the floating point math is off in the 20th bit of the mantissa. The answers will be wrong, but there is nothing about that that the spreadsheet could be expected to care about,
4) Not necessarily. GAI may continue to try to do what it was programmed to do, and only unintentionally destroy a small city in the process :)
Second list:
1) Wrong. The abilities of sufficiently complex systems are a huge space of events humans haven’t thought about yet, and so do not yet have preferences about. There is no way to know what their preferences would or should be for many many outcomes.
2) Error as failure to perform the requested action may take precedence over error as failure to anticipate hypothetical objections from some humans to something they hadn’t expected. For one thing, it is more clearly defined. We already know human-level intelligences act this way.
3) Asteroids and supervolcanoes are not better than humans at preventing errors. It is perfectly possible for something stupid to be able to kill you. Therefore something with greater cognitive and material resources than you, but still with the capacity to make mistakes can certainly kill you. For example, a government.
4) It is already possible for a very fallible human to make something that is better than humans at detecting certain kinds of errors.
5) No. Unless by dramatic you mean “impossibly perfect, magical and universal”.
Two points:
Firstly, “fails in a way that is harmful to its own functioning” appears to be tautological.
Secondly, you seem to be listing things that apply to any kind of AI in the NAI section—is this intentional? (This happens throughout your comment, in fact.)
Software that cares what you mean will be selected for by market forces.
Software that initially appears to care what you mean will be selected by market forces. But nearly all software that superficially looks Friendly isn’t Friendly. If there are seasoned AI researchers who can’t wrap their heads around the five theses, then how can I be confident that the Invisible Hand will both surpass them intellectually and recurrently sacrifice short-term gains on this basis?
Software that looks friendly isn’t really friendly in the sense that it really understands what we want. It isn’t dangerously unfriendly because we’re still here. If its commercially successful, it’s friendly enough for us to want it in our lives.
Human beings aren’t friendly, in the Friendly-AI sense. If a random human acquired immense power, it would probably result in an existential catastrophe. Humans do have a better sense of human value than, say, a can-opener does; they have more power and autonomy than a can-opener, so they need fuller access to human values in order to reach similar safety levels. A superintelligent AI would require even more access to human values to reach comparable safety levels.
There is more than one sense to friendly .AI.
If you grafted absolute power onto a human with average ethical insight, you might get absolute corruption. But what is that analogous to in .AI terms? Why assume asymmetric development by default?
If you assume top down singleton AI with a walled of ethics module, things look difficult. If you reverse this assumptions, FAI is already happening.