As for concrete failure scenarios, yes—that will be the point of that chapter.
As for a computational procedure that does better, probably not. That is beyond the scope of this book. The book will be too long merely covering the ground that it does. Detailed alternative proposals will have to come after I have laid this groundwork—for myself as much as for others. However, I’m not convinced at all that CEV is a failed project, and that an alternative is needed.
I’m confused by your asking such questions. Roko’s basilisk is a failure mode of CEV. I’m not aware of any work by you or other SIAI people that addresses it, never mind work that would prove the absence of other, yet undiscovered “creative” flaws.
Roko’s original proposed basilisk is not and never was the problem in Roko’s post. I don’t expect it to be part of CEV, and it would be caught by generic procedures meant to prevent CEV from running if 80% of humanity turns out to be selfish bastards, like the Last Jury procedure (as renamed by Bostrom) or extrapolating a weighted donor CEV with a binary veto over the whole procedure.
EDIT: I affirm all of Nesov’s answers (that I’ve seen so far) in the threads below.
wedrifid is right: if you’re now counting on failsafes to stop CEV from doing the wrong thing, that means you could apply the same procedures to any other proposed AI, so the real value of your life’s work is in the failsafe, not in CEV. What happened to all your clever arguments saying you can’t put external chains on an AI? I just don’t understand this at all.
Any given FAI design can turn out to be unable to do the right thing, which corresponds to tripping failsafes, but to be a FAI it must also be potentially capable (for all we know) of doing the right thing. Adequate failsafe should just turn off an ordinary AGI immediately, so it won’t work as an AI-in-chains FAI solution. You can’t make AI do the right thing just by adding failsafes, you also need to have a chance of winning.
wedrifid is right: if you’re now counting on failsafes to stop CEV from doing the wrong thing, that means you could apply the same procedures to any other proposed AI, so the real value of your life’s work is in the failsafe, not in CEV.
Since my name was mentioned I had better confirm that I generally agree with your point but would have left out this sentence:
What happened to all your clever arguments saying you can’t put external chains on an AI?
I don’t disagree with the principle of having a failsafe—and don’t think it is incompatible with the aforementioned clever arguments. But I do agree that “but there is a failsafe” is an utterly abysmal argument in favour of preferring CEV over an alternative AI goal system.
I just don’t understand this at all.
Tell me about it. With most people if they kept asking the same question when the answer is staring them in the face and then act oblivious as it is told to them repeatedly I dismiss them as either disingenuous or (possibly selectively) stupid in short order. But, to borrow wisdom from HP:MoR:
…. that just doesn’t sound like /Eliezer’s/ style.
…but you can only think that thought so many times, before you start to wonder about the trustworthiness of that whole ‘style’ concept.
it would be caught by generic procedures meant to prevent CEV from running if 80% of humanity turns out to be selfish bastards
I too am confused by your asking of such questions. Your own “80% of humanity turns out to be selfish bastards” gives a pretty good general answer to the question already.
“But we will not run it if is bad” seems like it could be used to reply to just about anything. Sure, it is good to have safety measures no matter what you are doing but not running it doesn’t make CEV desirable.
I’m completely confused now. I thought CEV was right by definition? If “80% of humanity turns out to be selfish bastards” then it will extrapolate on that. If we start to cherry pick certain outcomes according to our current perception, why run CEV at all?
I’m completely confused now. I thought CEV was right by definition? If “80% of humanity turns out to be selfish bastards” then it will extrapolate on that.
No, CEV is right by definition. When CEV is used as shorthand for “the coherent extrapolated volitions of all of humanity” as is the case there then it is quite probably not right at all. Because many humans, to put it extremely politely, have preferences that are distinctly different to what I would call ‘right’.
If we start to cherry pick certain outcomes according to our current perception, why run CEV at all?
Yes, that would be pointless, it would be far better to compare the outcomes to CEV<group_I_identify_with_sufficiently> (then just use the latter!) The purpose of doing CEV at all is for signalling and cooperation.
Before or after extrapolation? If the former then why does that matter, if the latter then how do you know?
Former in as much as it allows inferences about the latter. I don’t need to know with any particular confidence for the purposes of the point. The point was to illustrate possible (and overwhelmingly obvious) failure modes.
Hoping that CEV is desirable rather than outright unfriendly isn’t a particularly good reason to consider it. It is going to result in outcomes that are worse from the perspective of whoever is running the GAI than CEV and CEV.
The purpose of doing CEV at all is for signalling and cooperation (or, possibly, outright confusion).
CEV is not right by definition, it’s only well-defined given certain assumptions that can fail. It should be designed so that if it doesn’t shut down, then it’s probably right.
Sincere question: Why would “80% of humanity turns out to be selfish bastards” violate one of those assumptions? Is the problem the “selfish bastard” part? Or is it that the “80%” part implies less homogeneity among humans than CEV assumes?
Why would “80% of humanity turns out to be selfish bastards” violate one of those assumptions?
It would certainly seem that 80% of humanity turning out to be selfish bastards is compatible with CEV being well defined, but not with being ‘right’. This does not technically contradict anything in the grandparent (which is why I didn’t reply with the same question myself). It does, perhaps, go against the theme of Nesov’s comments.
Basically, and as you suggest, either it must be acknowledged that ‘not well defined’ and ‘possibly evil’ are two entirely different problems or something that amounts to ‘humans do not want things that suck’ must be one of the assumptions.
I still don’t get what kind of ‘right’ people are talking about.
Very similar to your right, for all practical purposes. A slight difference in how it is described though. You describe (if I recall), ‘right’ as being “in accordance with XiXiDu’s preferences”. Using Eliezer’s style of terminology you would instead describe ‘right’ as more like a photograph of what XiXiDu’s preferences are, without them necessarily including any explicit reference to XiXiDu.
In most cases it doesn’t really matter. It starts to matter once people start saying things like “But what if XiXiDu could take a pill that made him prefer that he eat babies? Would that mean it became right? Should XiXiDu take the pill?”
By the way, ‘right’ would also mean what the photo looks like after it has been airbrushed a bit in photoshop by an agent better at understanding what we actually want than we are at introspection and communication. So it’s an abstract representation of what you would want if you were smarter and more rational but still had your preferences.
Also note that Eliezer sometimes blurs the line between ‘right’ meaning what he would want and what some abstract “all of humanity” would want.
“But we will not run it if is bad” seems like it could be used to reply to just about anything. Sure, it is good to have safety measures no matter what you are doing but not running it doesn’t make CEV desirable.
In case where assumptions fail, and CEV ceases to be predictably good, safety measures shut it down, so nothing happens. In case where assumptions hold, it works. As a result, CEV has good expected utility, and gives us a chance to try again with a different design if it fails.
Failsafe measures are a great idea. They just don’t do anything to privilege CEV + failsafe over anything_else + failsafe.
Yes. They make sure that [CEV + failsafe] is not worse than not running any AIs. Uncertainty about whether CEV works makes expected [CEV + failsafe] significantly better than doing nothing. Presence of potential controlled shutdown scenarios doesn’t argue for worthlessness of the attempt, even where detailed awareness of these scenarios could be used to improve the plan.
“Not running it” does make [CEV + failsafe] desirable, as compared to doing nothing, even in the face of problems with [CEV], and nobody is going to run just [CEV]. So most arguments for presence of problems in CEV, if they are met with adequate failsafe specifications (which is far from a template to reply to anything, failsafes are not easy), do indeed lose a lot of traction. Besides, what are they arguments for? One needs a suggestion for improvement, and failsafes are intended to make it so that doing nothing is not an improvement, even though improvements over any given state of the plan would be dandy.
From your current perspective. But also given your extrapolated volition? If it is, then it won’t happen.
ETA The above was confusing and unclear. I don’t believe that one person can change the course of CEV. I rather meant to ask if he believes that it would be a failure mode even if it was the correct extrapolated volition of humanity.
If CEV has a serious bug, it won’t correctly implement anyone’s volition, and so someone’s volition saying that CEV shouldn’t have that bug won’t help.
Never mind, upvoted your comment. I wrote “then it won’t happen”. That was wrong, I don’t actually believe that. I meant to ask something different. Edited the comment to add a clarification.
If CEV has a serious bug, it won’t correctly implement anyone’s volition...
Obviously. A bug would be the inability to extrapolate volition correctly, not a certain outcome that is based on the correct extrapolated volition. So what did cousin_it mean by saying that outcome X is a failure mode? Does he mean that from his current perspective he doesn’t like outcome X or that outcome X would imply a bug in the process of extrapolating volition? (ETA I’m talking about CEV-humanity and not CEV-cousin-it. There would be no difference in the latter case.)
Extrapolated humanity decides that the best possible outcome is to become the Affront. Now, if the FAI put everyone in a separate VR and tricked him into believing that he was acting all Affront-like, then everything would be great—everyone would be content. However, people don’t just want the experience of being the Affront—everyone agrees that they want to be truly interacting with other sentiences which will often feel the brunt of each other’s coercive action.
Original version of grandparent contained, before I deleted it, “Besides the usual ‘Eating babies is wrong, what if CEV outputs eating babies, therefore a better solution is CEV plus code that outlaws eating babies.’”
If you want to, say, stop people from starving to death, would you be satisfied with being plopped on a holodeck with images of non-starving people? If so, then your stop-people-from-starving-to-death desire is not a desire to optimize reality into a smaller set of possible world-states, but simply a desire to have a set of sensations so that you believe starvation does not exist. The two are really different.
If you don’t understand what I’m saying, the first two paragraphs of this comment might explain it better.
Oh. In that case, it might be more precise to say that your utility function does not assign positive or negative utility to the suffering of others (if I’m interpreting your statement correctly). However, I’m curious about whether this statement holds true for you at extremes, so here’s a hypothetical.
I’m going to assume that you like ice cream. If you don’t like any sort of ice cream, substitute in a certain quantity of your favorite cookie. If you could get a scoop of ice cream (or a cookie) for free at the cost of a million babies thumbs cut off, would you take the ice cream/cookie?
If not, then you assign a non-zero utility to others suffering, so it might be true that you care very little, but it’s not true that you don’t care at all.
I think you misunderstand slightly. Sensory experience includes having the idea communicated to me that my action is causing suffering. I assign negative utility to other’s suffering in real life because the thought of such suffering is unpleasant.
Alright. Would you take the offer if Omega promised that he would remove your memories of the agreement of having a million babies’ thumbs cut off for a scoop of ice cream right after you made the agreement, so you could enjoy your ice-cream without guilt?
no, at the time of the decision i have sensory experience of having been the cause of suffering.
I don’t feel responsibility to those who suffer in that I would choose to holodeck myself rather than stay in reality and try to fix problems. this does not mean that I will cause suffering on purpose.
a better hypothetical dilemma might be if I could ONLY get access to the holodeck if I cause others to suffer (cypher from the matrix).
Mmkay. I would say that our utility functions are pretty different, in that case, since, with regard to suffering, I value world-states according to how much suffering they contain, not according to who causes the suffering.
I don’t understand this. If the singleton’s utility function was written such that it’s highest value was for humans to become the Affront, then making it the case that humans believed they were the Affront while not being the Affront would not satisfy the utility function. So why would the singleton do such a thing?
I don’t think that my brain was working optimally at 1am last night.
My first point was that our CEV might decide to go Baby-Eater, and so the FAI should treat the caring-about-the-real-world-state part of its utility function as a mere preference (like chocolate ice cream), and pop humanity into a nicely designed VR (though I didn’t have the precision of thought necessary to put it into such language). However, it’s pretty absurd for us to be telling our CEV what to do, considering that they’ll have much more information than we do and much more refined thinking processes. I actually don’t think that our Last Judge should do anything more than watch for coding errors (as in, we forgot to remove known psychological biases when creating the CEV).
My second point was that the FAI should also slip us into a VR if we desire a world-state in which we defect from each other (with similar results as in the prisoner’s dilemma). However, the counterargument from point 1 also applies to this point.
However, I’m not convinced at all that CEV is a failed project, and that an alternative is needed.
Maybe you should rephrase it then to say that you’ll present some possible failure modes of CEV that will have to be taken care of rather than “objections”.
As for concrete failure scenarios, yes—that will be the point of that chapter.
As for a computational procedure that does better, probably not. That is beyond the scope of this book. The book will be too long merely covering the ground that it does. Detailed alternative proposals will have to come after I have laid this groundwork—for myself as much as for others. However, I’m not convinced at all that CEV is a failed project, and that an alternative is needed.
Can you give me one quick sentence on a concrete failure mode of CEV?
I’m confused by your asking such questions. Roko’s basilisk is a failure mode of CEV. I’m not aware of any work by you or other SIAI people that addresses it, never mind work that would prove the absence of other, yet undiscovered “creative” flaws.
Roko’s original proposed basilisk is not and never was the problem in Roko’s post. I don’t expect it to be part of CEV, and it would be caught by generic procedures meant to prevent CEV from running if 80% of humanity turns out to be selfish bastards, like the Last Jury procedure (as renamed by Bostrom) or extrapolating a weighted donor CEV with a binary veto over the whole procedure.
EDIT: I affirm all of Nesov’s answers (that I’ve seen so far) in the threads below.
wedrifid is right: if you’re now counting on failsafes to stop CEV from doing the wrong thing, that means you could apply the same procedures to any other proposed AI, so the real value of your life’s work is in the failsafe, not in CEV. What happened to all your clever arguments saying you can’t put external chains on an AI? I just don’t understand this at all.
Any given FAI design can turn out to be unable to do the right thing, which corresponds to tripping failsafes, but to be a FAI it must also be potentially capable (for all we know) of doing the right thing. Adequate failsafe should just turn off an ordinary AGI immediately, so it won’t work as an AI-in-chains FAI solution. You can’t make AI do the right thing just by adding failsafes, you also need to have a chance of winning.
Affirmed.
Since my name was mentioned I had better confirm that I generally agree with your point but would have left out this sentence:
I don’t disagree with the principle of having a failsafe—and don’t think it is incompatible with the aforementioned clever arguments. But I do agree that “but there is a failsafe” is an utterly abysmal argument in favour of preferring CEV over an alternative AI goal system.
Tell me about it. With most people if they kept asking the same question when the answer is staring them in the face and then act oblivious as it is told to them repeatedly I dismiss them as either disingenuous or (possibly selectively) stupid in short order. But, to borrow wisdom from HP:MoR:
Is the Last Jury written up anywhere? It’s not in the draft manuscript I have.
I assume Last Jury is just the Last Judge from CEV but with majority voting among n Last Judges.
I too am confused by your asking of such questions. Your own “80% of humanity turns out to be selfish bastards” gives a pretty good general answer to the question already.
“But we will not run it if is bad” seems like it could be used to reply to just about anything. Sure, it is good to have safety measures no matter what you are doing but not running it doesn’t make CEV desirable.
I’m completely confused now. I thought CEV was right by definition? If “80% of humanity turns out to be selfish bastards” then it will extrapolate on that. If we start to cherry pick certain outcomes according to our current perception, why run CEV at all?
No, CEV is right by definition. When CEV is used as shorthand for “the coherent extrapolated volitions of all of humanity” as is the case there then it is quite probably not right at all. Because many humans, to put it extremely politely, have preferences that are distinctly different to what I would call ‘right’.
Yes, that would be pointless, it would be far better to compare the outcomes to CEV<group_I_identify_with_sufficiently> (then just use the latter!) The purpose of doing CEV at all is for signalling and cooperation.
Before or after extrapolation? If the former then why does that matter, if the latter then how do you know?
Former in as much as it allows inferences about the latter. I don’t need to know with any particular confidence for the purposes of the point. The point was to illustrate possible (and overwhelmingly obvious) failure modes.
Hoping that CEV is desirable rather than outright unfriendly isn’t a particularly good reason to consider it. It is going to result in outcomes that are worse from the perspective of whoever is running the GAI than CEV and CEV.
The purpose of doing CEV at all is for signalling and cooperation (or, possibly, outright confusion).
Do you mean it is simply an SIAI marketing strategy and that it is not what they are actually going to do?
Signalling and cooperation can include actual behavior.
CEV is not right by definition, it’s only well-defined given certain assumptions that can fail. It should be designed so that if it doesn’t shut down, then it’s probably right.
Sincere question: Why would “80% of humanity turns out to be selfish bastards” violate one of those assumptions? Is the problem the “selfish bastard” part? Or is it that the “80%” part implies less homogeneity among humans than CEV assumes?
It would certainly seem that 80% of humanity turning out to be selfish bastards is compatible with CEV being well defined, but not with being ‘right’. This does not technically contradict anything in the grandparent (which is why I didn’t reply with the same question myself). It does, perhaps, go against the theme of Nesov’s comments.
Basically, and as you suggest, either it must be acknowledged that ‘not well defined’ and ‘possibly evil’ are two entirely different problems or something that amounts to ‘humans do not want things that suck’ must be one of the assumptions.
I suppose you have to comprehend Yudkowsky’s metaethics to understand that sentence. I still don’t get what kind of ‘right’ people are talking about.
Very similar to your right, for all practical purposes. A slight difference in how it is described though. You describe (if I recall), ‘right’ as being “in accordance with XiXiDu’s preferences”. Using Eliezer’s style of terminology you would instead describe ‘right’ as more like a photograph of what XiXiDu’s preferences are, without them necessarily including any explicit reference to XiXiDu.
In most cases it doesn’t really matter. It starts to matter once people start saying things like “But what if XiXiDu could take a pill that made him prefer that he eat babies? Would that mean it became right? Should XiXiDu take the pill?”
By the way, ‘right’ would also mean what the photo looks like after it has been airbrushed a bit in photoshop by an agent better at understanding what we actually want than we are at introspection and communication. So it’s an abstract representation of what you would want if you were smarter and more rational but still had your preferences.
Also note that Eliezer sometimes blurs the line between ‘right’ meaning what he would want and what some abstract “all of humanity” would want.
In case where assumptions fail, and CEV ceases to be predictably good, safety measures shut it down, so nothing happens. In case where assumptions hold, it works. As a result, CEV has good expected utility, and gives us a chance to try again with a different design if it fails.
This does not seem to weaken the position you quoted in any way.
Failsafe measures are a great idea. They just don’t do anything to privileged CEV + failsafe over anything_else + failsafe.
Yes. They make sure that [CEV + failsafe] is not worse than not running any AIs. Uncertainty about whether CEV works makes expected [CEV + failsafe] significantly better than doing nothing. Presence of potential controlled shutdown scenarios doesn’t argue for worthlessness of the attempt, even where detailed awareness of these scenarios could be used to improve the plan.
I’m actually not even sure whether you are trying to disagree with me or not but once again, in case you are, nothing here weakens my position.
“Not running it” does make [CEV + failsafe] desirable, as compared to doing nothing, even in the face of problems with [CEV], and nobody is going to run just [CEV]. So most arguments for presence of problems in CEV, if they are met with adequate failsafe specifications (which is far from a template to reply to anything, failsafes are not easy), do indeed lose a lot of traction. Besides, what are they arguments for? One needs a suggestion for improvement, and failsafes are intended to make it so that doing nothing is not an improvement, even though improvements over any given state of the plan would be dandy.
Yes, this is trivially true and not currently disputed by anyone here. Nobody is suggesting doing nothing. Doing nothing is crazy.
Of course, Roko did not originally propose a basilisk at all. Just a novel solution to a obscure game theory problem.
From your current perspective. But also given your extrapolated volition? If it is, then it won’t happen.
ETA The above was confusing and unclear. I don’t believe that one person can change the course of CEV. I rather meant to ask if he believes that it would be a failure mode even if it was the correct extrapolated volition of humanity.
If CEV has a serious bug, it won’t correctly implement anyone’s volition, and so someone’s volition saying that CEV shouldn’t have that bug won’t help.
Never mind, upvoted your comment. I wrote “then it won’t happen”. That was wrong, I don’t actually believe that. I meant to ask something different. Edited the comment to add a clarification.
Obviously. A bug would be the inability to extrapolate volition correctly, not a certain outcome that is based on the correct extrapolated volition. So what did cousin_it mean by saying that outcome X is a failure mode? Does he mean that from his current perspective he doesn’t like outcome X or that outcome X would imply a bug in the process of extrapolating volition? (ETA I’m talking about CEV-humanity and not CEV-cousin-it. There would be no difference in the latter case.)
Not until I get to that part of the writing and research, no.
That is, I’m applying your advice to hold off on proposing solutions until the problem has been discussed as thoroughly as possible without suggesting any.
Has this been published anywhere yet?
A related thing that has since been published is Ideal Advisor Theories and Personal CEV.
I have no plans to write the book; see instead Bostrom’s far superior Superintelligence, forthcoming.
Extrapolated humanity decides that the best possible outcome is to become the Affront. Now, if the FAI put everyone in a separate VR and tricked him into believing that he was acting all Affront-like, then everything would be great—everyone would be content. However, people don’t just want the experience of being the Affront—everyone agrees that they want to be truly interacting with other sentiences which will often feel the brunt of each other’s coercive action.
Original version of grandparent contained, before I deleted it, “Besides the usual ‘Eating babies is wrong, what if CEV outputs eating babies, therefore a better solution is CEV plus code that outlaws eating babies.’”
I have never understood what is wrong with the amnesia-holodecking scenario. (is there a proper name for this?)
If you want to, say, stop people from starving to death, would you be satisfied with being plopped on a holodeck with images of non-starving people? If so, then your stop-people-from-starving-to-death desire is not a desire to optimize reality into a smaller set of possible world-states, but simply a desire to have a set of sensations so that you believe starvation does not exist. The two are really different.
If you don’t understand what I’m saying, the first two paragraphs of this comment might explain it better.
thanks for clarifying. I guess I’m evil. It’s a good thing to know about oneself.
Uh, that was a joke, right?
no.
What definition of evil are you using? I’m having trouble understanding why (how?) you would declare yourself evil, especially evil_nazgulnarsil.
i don’t care about suffering independent of my sensory perception of it causing me distress.
Oh. In that case, it might be more precise to say that your utility function does not assign positive or negative utility to the suffering of others (if I’m interpreting your statement correctly). However, I’m curious about whether this statement holds true for you at extremes, so here’s a hypothetical.
I’m going to assume that you like ice cream. If you don’t like any sort of ice cream, substitute in a certain quantity of your favorite cookie. If you could get a scoop of ice cream (or a cookie) for free at the cost of a million babies thumbs cut off, would you take the ice cream/cookie?
If not, then you assign a non-zero utility to others suffering, so it might be true that you care very little, but it’s not true that you don’t care at all.
I think you misunderstand slightly. Sensory experience includes having the idea communicated to me that my action is causing suffering. I assign negative utility to other’s suffering in real life because the thought of such suffering is unpleasant.
Alright. Would you take the offer if Omega promised that he would remove your memories of the agreement of having a million babies’ thumbs cut off for a scoop of ice cream right after you made the agreement, so you could enjoy your ice-cream without guilt?
no, at the time of the decision i have sensory experience of having been the cause of suffering.
I don’t feel responsibility to those who suffer in that I would choose to holodeck myself rather than stay in reality and try to fix problems. this does not mean that I will cause suffering on purpose.
a better hypothetical dilemma might be if I could ONLY get access to the holodeck if I cause others to suffer (cypher from the matrix).
Okay, so you would feel worse if you had caused people the same amount of suffering than you would if someone else had done so?
yes
Mmkay. I would say that our utility functions are pretty different, in that case, since, with regard to suffering, I value world-states according to how much suffering they contain, not according to who causes the suffering.
Well, it’s essentially equivalent to wireheading.
which I also plan to do if everything goes tits-up.
Dorikka,
I don’t understand this. If the singleton’s utility function was written such that it’s highest value was for humans to become the Affront, then making it the case that humans believed they were the Affront while not being the Affront would not satisfy the utility function. So why would the singleton do such a thing?
I don’t think that my brain was working optimally at 1am last night.
My first point was that our CEV might decide to go Baby-Eater, and so the FAI should treat the caring-about-the-real-world-state part of its utility function as a mere preference (like chocolate ice cream), and pop humanity into a nicely designed VR (though I didn’t have the precision of thought necessary to put it into such language). However, it’s pretty absurd for us to be telling our CEV what to do, considering that they’ll have much more information than we do and much more refined thinking processes. I actually don’t think that our Last Judge should do anything more than watch for coding errors (as in, we forgot to remove known psychological biases when creating the CEV).
My second point was that the FAI should also slip us into a VR if we desire a world-state in which we defect from each other (with similar results as in the prisoner’s dilemma). However, the counterargument from point 1 also applies to this point.
Maybe you should rephrase it then to say that you’ll present some possible failure modes of CEV that will have to be taken care of rather than “objections”.
No, I’m definitely presenting objections in that chapter.