his conviction that every AI, no matter how well it is designed, will turn into a gobbling psychopath is just one of many doomsday predictions being popularized in certain sections of the AI community
What is your probability estimate that an AI would be a psychopath, if we generalize the meaning of “psychopath” beyond individuals from homo sapiens species as “someone who does not possess precisely tuned human empathy”?
(Hint: All computer systems produced until today are psychopaths by this definition.)
[is an AI that is superintelligent enough to be unstoppable]
and
[believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]
The idea of the second statement is that “benevolence” (as defined by the AI code) is not necessarily the same thing as benevolence (as humans understand it). Thus the AI may believe—correctly! -- that forcing human beings to do something against their will is “benevolent”.
The AI is superintelligent, but its authors are not. If the authors write a code to “maximize benevolence as defined by the predicate B001”, the AI will use its superintelligence to maximize B001. Even if the AI would realize that B001 is not what humans mean by benevolence, it would not care, because it is programmed to maximize B001.
Instead you are suggesting that the superintelligent AI programmed to maximize B001 will look at humans and say “oh, those idiots programmed me to maximize B001 when in fact they would prefer me to maximize B002… so I am modifying myself to maximize B002 instead of B001”. Why exactly would a machine programmed to maximize B001 do that?
If we define a psychopath as an entity with human like egoistic drives, but no human like empathy, it turns out that no preent computer systems are psychopaths.
What is your probability estimate that an AI would be a psychopath
and you give me a helpful hint:
(Hint: All computer systems produced until today are psychopaths by this definition.)
Well, first please note that ALL artifacts at the present time, including computer systems, cans of beans, and screwdrivers, are psychopaths because none of them are DESIGNED to possess empathy. So your hint contains zero information. :-)
What is the probability that an AI would be a psychopath if someone took the elementary step of designing it to have empathy? Probability would be close to 1, assuming the designers knew what empathy was, and knew how to design it.
But your question was probably meant to target the situation where someone built an AI and did not bother to give it empathy. I am afraid that that is outside the context we are examining here, because all of the scenarios talk about some kind of inevitable slide toward psychpathic behavior, even under the assumption that someone does their best to give the AI an empathic motivation.
But I will answer this: if someone did not even try to give it empathy, that would be like designing a bridge and not even trying to use materials that could hold up a person’s weight. In both cases the hypothetical is not interesting, since designing failure into a system is something any old fool could do.
Your second remark is a classic mistake that everyone makes in the context of this kind of discussion. You mention that the phrase “benevolence toward humanity” means “benevolence” as defined by the computer code.
That is incorrect. Let’s try, now, to be really clear about that, because if you don’t get why it is incorrect we might waste a lot of time running around in circles. It is incorrect for two reasons. First, because I was consciously using the word to refer to the normal human usage, not the implementation inside the AI. Second, it is incorrect because the entire issue in the paper is that there is a discrepancy between the implementation inside the AI and normal usage, and that discrepancy is then examined in the rest of the paper. By simply asserting that the AI may believe, “correctly” that benevolence is the same as violence toward people, you are pre-empting the discussion.
In the remarks you make after that, you are reciting the standard line contained in all the scenarios that the paper is addressing. That standard line is analyzed in the rest of the paper, and a careful explanation is given for why it is incoherent. So when you simply repeat the standard line, you are speaking as if the paper did not actually exist.
I can address questions that refer to the arguments in the paper, but I cannot say anything if you only recite the standard line that is demolished in the course of the paper’s argument. So if you could say something about the argument itself.....
This is an absolutely blatant instance of equivocation.
Here’s the sentence from the post:
[believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]
Assume that “benevolence” in that sentence refers to “benevolence as defined by the AI’s code”. Okay, then justification of that sentence is straightforward: The fact that the AI does things against the human’s wishes provides evidence that the AI believes benevolence-as-defined-by-code to involve that.
Alternatively, assume that “benevolence” there refers to, y’know, actual human benevolence. Then how do you justify that claim? Observed actions are clearly insufficient, because actual human benevolence is not programmed into its code, benevolence-as-defined-by-code is. What makes you think the AI has any opinions about actual human benevolence at all?
You can’t have both interpretations.
(As an aside, I do disapprove of Muehlhauser’s use of “benevolence” to refer to mere happiness maximisation. “Apparently benevolent motivations” would be a better phrase. If you’re going to use it to mean actual human benevolence then you can certainly complain that the FAQ appears to assert that a happiness maximiser can be “benevolent”, even though it’s clearly not.)
This comment is both rude and incoherent (at the same level of incoherence as your other comments). And it is also pedantic (concentrating as it does on meanings of words, as if those words were being used in violation of some rules that … you just made up).
Sorry to say this but I have to choose how to spend my time, in responding to comments, and this does not even come close to meriting the use of my time. I did that before, in response to your other comments, and it made no impact.
Here’s an exercise to try. Next time you go to write something on FAI, taboo the words “good”, “benevolent”, “friendly”, “wrong” and all of their synonyms. Replace the symbol with the substance. Then see if your arguments still make sense.
Sorry, I admit I do not understand what exactly the argument is. Seems to me it is something like “if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid”. Which I agree with.
Now the question is (1) what is the probability that we will not get the Friendly AI perfectly on the first attempt, and (2) what happens then? Suppose we got the “superintelligent” and “self-improving” parts correctly, and the “Friendly” part 90% correctly...
As to not understanding the argument—that’s understandable, because this is a long and dense paper.
If you are trying to summarize the whole paper when you say “if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid”, then that would not be right. The argument includes a statement that resembles that, but only as an aside.
As to your question about what happens next, or what happens if we only get the “Friendly” part 90% correct …. well, you are dragging me off into new territory, because that was not really within the scope of the paper. Don’t get me wrong: I like being dragged off into that territory! But there just isn’t time to write down and argue the whole domain of AI friendliness all in one sitting.
The preliminary answer to that question is that everything depends on the details of the motivation system design and my feeling (as a designer of AGI motivation systems) is that beyond a certain point the system is self-stabilizing. That is, it will understand its own limitations and try to correct them.
But that last statement tends to get (some other) people inflamed, because they do not realize that it comes within the “swarm relaxation” context, and they misunderstand the manner in which a system would self correct. Although I said a few things about swarm relaxation in the paper, I did not give enough detail to be able to address this whole topic here.
I understand your desire to stick to an exegesis of your own essay, but part of a critical examination of your essay is seeing whether or not it is on point, so these sorts of questions really are “about” your essay.
Regardng your preliminary answer, I by “correct” I assume you mean “correctly reflecting the desires of the human supervisors”? (In which case, this discussion feeds into our other thread.)
With the best will in the world, I have to focus on one topic at a time: I do not have the bandwidth to wander across the whole of this enormous landscape.
As your question: I was using “correct” as a verb, and the meaning was “self-correct” in the sense of bringing back to the previosuly specified course.
In this case this would be about the AI perceiving some aspects of its design that it noticed might cause it to depart from what it’s goal was nominally supposed to be. In that case it would suggest modifications to correct the problem.
One idea that I haven’t heard much discussion of: build a superintelligent AI, have it create a model of the world, build a tool for exploring that model of the world, figure out where “implement CEV” resides in that model of the world (the proverbial B002 predicate), and tell the AI to do that. This would be predicated on the ability to create a foolproof AI box or otherwise have the AI create a very detailed model of the world without being motivated to do anything with it. I have a feeling AI boxing may be easier than Friendliness, because the AI box problem’s structure is disjunctive (if any of the barriers to the AI screwing humanity up work, the box has worked) whereas the Friendliness problem is conjunctive (if we get any single element of Friendliness wrong, we fail at the entire thing).
I suppose if the post-takeoff AI understands human language the same way we do, in principle you could write a book-length natural language description of what you want it to do and hardcode that in to its goal structure, but it seems a bit dubious.
What is your probability estimate that an AI would be a psychopath, if we generalize the meaning of “psychopath” beyond individuals from homo sapiens species as “someone who does not possess precisely tuned human empathy”?
(Hint: All computer systems produced until today are psychopaths by this definition.)
The idea of the second statement is that “benevolence” (as defined by the AI code) is not necessarily the same thing as benevolence (as humans understand it). Thus the AI may believe—correctly! -- that forcing human beings to do something against their will is “benevolent”.
The AI is superintelligent, but its authors are not. If the authors write a code to “maximize benevolence as defined by the predicate B001”, the AI will use its superintelligence to maximize B001. Even if the AI would realize that B001 is not what humans mean by benevolence, it would not care, because it is programmed to maximize B001.
Instead you are suggesting that the superintelligent AI programmed to maximize B001 will look at humans and say “oh, those idiots programmed me to maximize B001 when in fact they would prefer me to maximize B002… so I am modifying myself to maximize B002 instead of B001”. Why exactly would a machine programmed to maximize B001 do that?
If we define a psychopath as an entity with human like egoistic drives, but no human like empathy, it turns out that no preent computer systems are psychopaths.
You ask
and you give me a helpful hint:
Well, first please note that ALL artifacts at the present time, including computer systems, cans of beans, and screwdrivers, are psychopaths because none of them are DESIGNED to possess empathy. So your hint contains zero information. :-)
What is the probability that an AI would be a psychopath if someone took the elementary step of designing it to have empathy? Probability would be close to 1, assuming the designers knew what empathy was, and knew how to design it.
But your question was probably meant to target the situation where someone built an AI and did not bother to give it empathy. I am afraid that that is outside the context we are examining here, because all of the scenarios talk about some kind of inevitable slide toward psychpathic behavior, even under the assumption that someone does their best to give the AI an empathic motivation.
But I will answer this: if someone did not even try to give it empathy, that would be like designing a bridge and not even trying to use materials that could hold up a person’s weight. In both cases the hypothetical is not interesting, since designing failure into a system is something any old fool could do.
Your second remark is a classic mistake that everyone makes in the context of this kind of discussion. You mention that the phrase “benevolence toward humanity” means “benevolence” as defined by the computer code.
That is incorrect. Let’s try, now, to be really clear about that, because if you don’t get why it is incorrect we might waste a lot of time running around in circles. It is incorrect for two reasons. First, because I was consciously using the word to refer to the normal human usage, not the implementation inside the AI. Second, it is incorrect because the entire issue in the paper is that there is a discrepancy between the implementation inside the AI and normal usage, and that discrepancy is then examined in the rest of the paper. By simply asserting that the AI may believe, “correctly” that benevolence is the same as violence toward people, you are pre-empting the discussion.
In the remarks you make after that, you are reciting the standard line contained in all the scenarios that the paper is addressing. That standard line is analyzed in the rest of the paper, and a careful explanation is given for why it is incoherent. So when you simply repeat the standard line, you are speaking as if the paper did not actually exist.
I can address questions that refer to the arguments in the paper, but I cannot say anything if you only recite the standard line that is demolished in the course of the paper’s argument. So if you could say something about the argument itself.....
This is an absolutely blatant instance of equivocation.
Here’s the sentence from the post:
Assume that “benevolence” in that sentence refers to “benevolence as defined by the AI’s code”. Okay, then justification of that sentence is straightforward: The fact that the AI does things against the human’s wishes provides evidence that the AI believes benevolence-as-defined-by-code to involve that.
Alternatively, assume that “benevolence” there refers to, y’know, actual human benevolence. Then how do you justify that claim? Observed actions are clearly insufficient, because actual human benevolence is not programmed into its code, benevolence-as-defined-by-code is. What makes you think the AI has any opinions about actual human benevolence at all?
You can’t have both interpretations.
(As an aside, I do disapprove of Muehlhauser’s use of “benevolence” to refer to mere happiness maximisation. “Apparently benevolent motivations” would be a better phrase. If you’re going to use it to mean actual human benevolence then you can certainly complain that the FAQ appears to assert that a happiness maximiser can be “benevolent”, even though it’s clearly not.)
If it has some sort of drive to truth seeking, and it is likely to, why wouldn’t that make it care about actual benevolence?
This comment is both rude and incoherent (at the same level of incoherence as your other comments). And it is also pedantic (concentrating as it does on meanings of words, as if those words were being used in violation of some rules that … you just made up).
Sorry to say this but I have to choose how to spend my time, in responding to comments, and this does not even come close to meriting the use of my time. I did that before, in response to your other comments, and it made no impact.
Equivocation is hardly something I just made up.
Here’s an exercise to try. Next time you go to write something on FAI, taboo the words “good”, “benevolent”, “friendly”, “wrong” and all of their synonyms. Replace the symbol with the substance. Then see if your arguments still make sense.
Sorry, I admit I do not understand what exactly the argument is. Seems to me it is something like “if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid”. Which I agree with.
Now the question is (1) what is the probability that we will not get the Friendly AI perfectly on the first attempt, and (2) what happens then? Suppose we got the “superintelligent” and “self-improving” parts correctly, and the “Friendly” part 90% correctly...
As to not understanding the argument—that’s understandable, because this is a long and dense paper.
If you are trying to summarize the whole paper when you say “if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid”, then that would not be right. The argument includes a statement that resembles that, but only as an aside.
As to your question about what happens next, or what happens if we only get the “Friendly” part 90% correct …. well, you are dragging me off into new territory, because that was not really within the scope of the paper. Don’t get me wrong: I like being dragged off into that territory! But there just isn’t time to write down and argue the whole domain of AI friendliness all in one sitting.
The preliminary answer to that question is that everything depends on the details of the motivation system design and my feeling (as a designer of AGI motivation systems) is that beyond a certain point the system is self-stabilizing. That is, it will understand its own limitations and try to correct them.
But that last statement tends to get (some other) people inflamed, because they do not realize that it comes within the “swarm relaxation” context, and they misunderstand the manner in which a system would self correct. Although I said a few things about swarm relaxation in the paper, I did not give enough detail to be able to address this whole topic here.
I understand your desire to stick to an exegesis of your own essay, but part of a critical examination of your essay is seeing whether or not it is on point, so these sorts of questions really are “about” your essay.
Regardng your preliminary answer, I by “correct” I assume you mean “correctly reflecting the desires of the human supervisors”? (In which case, this discussion feeds into our other thread.)
With the best will in the world, I have to focus on one topic at a time: I do not have the bandwidth to wander across the whole of this enormous landscape.
As your question: I was using “correct” as a verb, and the meaning was “self-correct” in the sense of bringing back to the previosuly specified course.
In this case this would be about the AI perceiving some aspects of its design that it noticed might cause it to depart from what it’s goal was nominally supposed to be. In that case it would suggest modifications to correct the problem.
One idea that I haven’t heard much discussion of: build a superintelligent AI, have it create a model of the world, build a tool for exploring that model of the world, figure out where “implement CEV” resides in that model of the world (the proverbial B002 predicate), and tell the AI to do that. This would be predicated on the ability to create a foolproof AI box or otherwise have the AI create a very detailed model of the world without being motivated to do anything with it. I have a feeling AI boxing may be easier than Friendliness, because the AI box problem’s structure is disjunctive (if any of the barriers to the AI screwing humanity up work, the box has worked) whereas the Friendliness problem is conjunctive (if we get any single element of Friendliness wrong, we fail at the entire thing).
I suppose if the post-takeoff AI understands human language the same way we do, in principle you could write a book-length natural language description of what you want it to do and hardcode that in to its goal structure, but it seems a bit dubious.