What is your probability estimate that an AI would be a psychopath
and you give me a helpful hint:
(Hint: All computer systems produced until today are psychopaths by this definition.)
Well, first please note that ALL artifacts at the present time, including computer systems, cans of beans, and screwdrivers, are psychopaths because none of them are DESIGNED to possess empathy. So your hint contains zero information. :-)
What is the probability that an AI would be a psychopath if someone took the elementary step of designing it to have empathy? Probability would be close to 1, assuming the designers knew what empathy was, and knew how to design it.
But your question was probably meant to target the situation where someone built an AI and did not bother to give it empathy. I am afraid that that is outside the context we are examining here, because all of the scenarios talk about some kind of inevitable slide toward psychpathic behavior, even under the assumption that someone does their best to give the AI an empathic motivation.
But I will answer this: if someone did not even try to give it empathy, that would be like designing a bridge and not even trying to use materials that could hold up a person’s weight. In both cases the hypothetical is not interesting, since designing failure into a system is something any old fool could do.
Your second remark is a classic mistake that everyone makes in the context of this kind of discussion. You mention that the phrase “benevolence toward humanity” means “benevolence” as defined by the computer code.
That is incorrect. Let’s try, now, to be really clear about that, because if you don’t get why it is incorrect we might waste a lot of time running around in circles. It is incorrect for two reasons. First, because I was consciously using the word to refer to the normal human usage, not the implementation inside the AI. Second, it is incorrect because the entire issue in the paper is that there is a discrepancy between the implementation inside the AI and normal usage, and that discrepancy is then examined in the rest of the paper. By simply asserting that the AI may believe, “correctly” that benevolence is the same as violence toward people, you are pre-empting the discussion.
In the remarks you make after that, you are reciting the standard line contained in all the scenarios that the paper is addressing. That standard line is analyzed in the rest of the paper, and a careful explanation is given for why it is incoherent. So when you simply repeat the standard line, you are speaking as if the paper did not actually exist.
I can address questions that refer to the arguments in the paper, but I cannot say anything if you only recite the standard line that is demolished in the course of the paper’s argument. So if you could say something about the argument itself.....
This is an absolutely blatant instance of equivocation.
Here’s the sentence from the post:
[believes that benevolence toward humanity might involve forcing human beings to do something violently against their will.]
Assume that “benevolence” in that sentence refers to “benevolence as defined by the AI’s code”. Okay, then justification of that sentence is straightforward: The fact that the AI does things against the human’s wishes provides evidence that the AI believes benevolence-as-defined-by-code to involve that.
Alternatively, assume that “benevolence” there refers to, y’know, actual human benevolence. Then how do you justify that claim? Observed actions are clearly insufficient, because actual human benevolence is not programmed into its code, benevolence-as-defined-by-code is. What makes you think the AI has any opinions about actual human benevolence at all?
You can’t have both interpretations.
(As an aside, I do disapprove of Muehlhauser’s use of “benevolence” to refer to mere happiness maximisation. “Apparently benevolent motivations” would be a better phrase. If you’re going to use it to mean actual human benevolence then you can certainly complain that the FAQ appears to assert that a happiness maximiser can be “benevolent”, even though it’s clearly not.)
This comment is both rude and incoherent (at the same level of incoherence as your other comments). And it is also pedantic (concentrating as it does on meanings of words, as if those words were being used in violation of some rules that … you just made up).
Sorry to say this but I have to choose how to spend my time, in responding to comments, and this does not even come close to meriting the use of my time. I did that before, in response to your other comments, and it made no impact.
Here’s an exercise to try. Next time you go to write something on FAI, taboo the words “good”, “benevolent”, “friendly”, “wrong” and all of their synonyms. Replace the symbol with the substance. Then see if your arguments still make sense.
Sorry, I admit I do not understand what exactly the argument is. Seems to me it is something like “if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid”. Which I agree with.
Now the question is (1) what is the probability that we will not get the Friendly AI perfectly on the first attempt, and (2) what happens then? Suppose we got the “superintelligent” and “self-improving” parts correctly, and the “Friendly” part 90% correctly...
As to not understanding the argument—that’s understandable, because this is a long and dense paper.
If you are trying to summarize the whole paper when you say “if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid”, then that would not be right. The argument includes a statement that resembles that, but only as an aside.
As to your question about what happens next, or what happens if we only get the “Friendly” part 90% correct …. well, you are dragging me off into new territory, because that was not really within the scope of the paper. Don’t get me wrong: I like being dragged off into that territory! But there just isn’t time to write down and argue the whole domain of AI friendliness all in one sitting.
The preliminary answer to that question is that everything depends on the details of the motivation system design and my feeling (as a designer of AGI motivation systems) is that beyond a certain point the system is self-stabilizing. That is, it will understand its own limitations and try to correct them.
But that last statement tends to get (some other) people inflamed, because they do not realize that it comes within the “swarm relaxation” context, and they misunderstand the manner in which a system would self correct. Although I said a few things about swarm relaxation in the paper, I did not give enough detail to be able to address this whole topic here.
I understand your desire to stick to an exegesis of your own essay, but part of a critical examination of your essay is seeing whether or not it is on point, so these sorts of questions really are “about” your essay.
Regardng your preliminary answer, I by “correct” I assume you mean “correctly reflecting the desires of the human supervisors”? (In which case, this discussion feeds into our other thread.)
With the best will in the world, I have to focus on one topic at a time: I do not have the bandwidth to wander across the whole of this enormous landscape.
As your question: I was using “correct” as a verb, and the meaning was “self-correct” in the sense of bringing back to the previosuly specified course.
In this case this would be about the AI perceiving some aspects of its design that it noticed might cause it to depart from what it’s goal was nominally supposed to be. In that case it would suggest modifications to correct the problem.
You ask
and you give me a helpful hint:
Well, first please note that ALL artifacts at the present time, including computer systems, cans of beans, and screwdrivers, are psychopaths because none of them are DESIGNED to possess empathy. So your hint contains zero information. :-)
What is the probability that an AI would be a psychopath if someone took the elementary step of designing it to have empathy? Probability would be close to 1, assuming the designers knew what empathy was, and knew how to design it.
But your question was probably meant to target the situation where someone built an AI and did not bother to give it empathy. I am afraid that that is outside the context we are examining here, because all of the scenarios talk about some kind of inevitable slide toward psychpathic behavior, even under the assumption that someone does their best to give the AI an empathic motivation.
But I will answer this: if someone did not even try to give it empathy, that would be like designing a bridge and not even trying to use materials that could hold up a person’s weight. In both cases the hypothetical is not interesting, since designing failure into a system is something any old fool could do.
Your second remark is a classic mistake that everyone makes in the context of this kind of discussion. You mention that the phrase “benevolence toward humanity” means “benevolence” as defined by the computer code.
That is incorrect. Let’s try, now, to be really clear about that, because if you don’t get why it is incorrect we might waste a lot of time running around in circles. It is incorrect for two reasons. First, because I was consciously using the word to refer to the normal human usage, not the implementation inside the AI. Second, it is incorrect because the entire issue in the paper is that there is a discrepancy between the implementation inside the AI and normal usage, and that discrepancy is then examined in the rest of the paper. By simply asserting that the AI may believe, “correctly” that benevolence is the same as violence toward people, you are pre-empting the discussion.
In the remarks you make after that, you are reciting the standard line contained in all the scenarios that the paper is addressing. That standard line is analyzed in the rest of the paper, and a careful explanation is given for why it is incoherent. So when you simply repeat the standard line, you are speaking as if the paper did not actually exist.
I can address questions that refer to the arguments in the paper, but I cannot say anything if you only recite the standard line that is demolished in the course of the paper’s argument. So if you could say something about the argument itself.....
This is an absolutely blatant instance of equivocation.
Here’s the sentence from the post:
Assume that “benevolence” in that sentence refers to “benevolence as defined by the AI’s code”. Okay, then justification of that sentence is straightforward: The fact that the AI does things against the human’s wishes provides evidence that the AI believes benevolence-as-defined-by-code to involve that.
Alternatively, assume that “benevolence” there refers to, y’know, actual human benevolence. Then how do you justify that claim? Observed actions are clearly insufficient, because actual human benevolence is not programmed into its code, benevolence-as-defined-by-code is. What makes you think the AI has any opinions about actual human benevolence at all?
You can’t have both interpretations.
(As an aside, I do disapprove of Muehlhauser’s use of “benevolence” to refer to mere happiness maximisation. “Apparently benevolent motivations” would be a better phrase. If you’re going to use it to mean actual human benevolence then you can certainly complain that the FAQ appears to assert that a happiness maximiser can be “benevolent”, even though it’s clearly not.)
If it has some sort of drive to truth seeking, and it is likely to, why wouldn’t that make it care about actual benevolence?
This comment is both rude and incoherent (at the same level of incoherence as your other comments). And it is also pedantic (concentrating as it does on meanings of words, as if those words were being used in violation of some rules that … you just made up).
Sorry to say this but I have to choose how to spend my time, in responding to comments, and this does not even come close to meriting the use of my time. I did that before, in response to your other comments, and it made no impact.
Equivocation is hardly something I just made up.
Here’s an exercise to try. Next time you go to write something on FAI, taboo the words “good”, “benevolent”, “friendly”, “wrong” and all of their synonyms. Replace the symbol with the substance. Then see if your arguments still make sense.
Sorry, I admit I do not understand what exactly the argument is. Seems to me it is something like “if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid”. Which I agree with.
Now the question is (1) what is the probability that we will not get the Friendly AI perfectly on the first attempt, and (2) what happens then? Suppose we got the “superintelligent” and “self-improving” parts correctly, and the “Friendly” part 90% correctly...
As to not understanding the argument—that’s understandable, because this is a long and dense paper.
If you are trying to summarize the whole paper when you say “if we succeed to make the Friendly AI perfectly on the first attempt, then we do not have to worry about what could go wrong, because the perfect Friendly AI would not do anything stupid”, then that would not be right. The argument includes a statement that resembles that, but only as an aside.
As to your question about what happens next, or what happens if we only get the “Friendly” part 90% correct …. well, you are dragging me off into new territory, because that was not really within the scope of the paper. Don’t get me wrong: I like being dragged off into that territory! But there just isn’t time to write down and argue the whole domain of AI friendliness all in one sitting.
The preliminary answer to that question is that everything depends on the details of the motivation system design and my feeling (as a designer of AGI motivation systems) is that beyond a certain point the system is self-stabilizing. That is, it will understand its own limitations and try to correct them.
But that last statement tends to get (some other) people inflamed, because they do not realize that it comes within the “swarm relaxation” context, and they misunderstand the manner in which a system would self correct. Although I said a few things about swarm relaxation in the paper, I did not give enough detail to be able to address this whole topic here.
I understand your desire to stick to an exegesis of your own essay, but part of a critical examination of your essay is seeing whether or not it is on point, so these sorts of questions really are “about” your essay.
Regardng your preliminary answer, I by “correct” I assume you mean “correctly reflecting the desires of the human supervisors”? (In which case, this discussion feeds into our other thread.)
With the best will in the world, I have to focus on one topic at a time: I do not have the bandwidth to wander across the whole of this enormous landscape.
As your question: I was using “correct” as a verb, and the meaning was “self-correct” in the sense of bringing back to the previosuly specified course.
In this case this would be about the AI perceiving some aspects of its design that it noticed might cause it to depart from what it’s goal was nominally supposed to be. In that case it would suggest modifications to correct the problem.