Citation? This is commonly asserted by AI risk proponents, but I’m not sure I believe it. My best friend’s values are slightly misaligned relative to my own, but if my best friend became superintelligent, that seems to me like it’d be a pretty good outcome.
I’m familiar with lots of the things Eliezer Yudkowsky has said about AI. That doesn’t mean I agree with them. Less Wrong has an unfortunate culture of not discussing topics once the Great Teacher has made a pronouncement.
Plus, I don’t think philosophytorres’ claim is obvious even if you accept Yudkowsky’s arguments.
Fragility of value thesis. Getting a goal system 90% right does not give you 90% of the value, any more than correctly dialing 9 out of 10 digits of my phone number will connect you to somebody who’s 90% similar to Eliezer Yudkowsky. There are multiple dimensions for which eliminating that dimension of value would eliminate almost all value from the future. For example an alien species which shared almost all of human value except that their parameter setting for “boredom” was much lower, might devote most of their computational power to replaying a single peak, optimal experience over and over again with slightly different pixel colors (or the equivalent thereof). Friendly AI is more like a satisficing threshold than something where we’re trying to eke out successive 10% improvements. See: Yudkowsky (2009, 2011).
OK, so do my best friend’s values constitute a 90% match? A 99.9% match? Do they pass the satisficing threshold?
Also, Eliezer’s boredom-free scenario sounds like a pretty good outcome to me, all things considered. If an AGI modified me so I could no longer get bored, and then replayed a peak experience for me for millions of years, I’d consider that a positive singularity. Certainly not a “catastrophe” in the sense that an earthquake is a catastrophe. (Well, perhaps a catastrophe of opportunity cost, but basically every outcome is a catastrophe of opportunity cost on a long enough timescale, so that’s not a very interesting objection.) The utility function is not up for grabs—I am the expert on my values, not the Great Teacher.
A common reaction to first encountering the problem statement of Friendly AI (“Ensure that the creation of a generally intelligent, self-improving, eventually superintelligent system realizes a positive outcome”) is to propose a single moral value which allegedly suffices; or to reject the problem by replying that “constraining” our creations is undesirable or unnecessary. This paper makes the case that a criterion for describing a “positive outcome,” despite the shortness of the English phrase, contains considerable complexity hidden from us by our own thought processes, which only search positive-value parts of the action space, and implicitly think as if code is interpreted by an anthropomorphic ghost-in-the-machine. Abandoning inheritance from human value (at least as a basis for renormalizing to reflective equilibria) will yield futures worthless even from the standpoint of AGI researchers who consider themselves to have cosmopolitan values not tied to the exact forms or desires of humanity.
It sounds to me like Eliezer’s point is more about the complexity of values, not the need to prevent slight misalignment. In other words, Eliezer seems to argue here that a naively programmed definition of “positive value” constitutes a gross misalignment, NOT that a slight misalignment constitutes a catastrophic outcome.
I think that small error inside a value description could result in bad result, but it is not so, if we have a list of independent values.
In phone example if I lose one digit from someone number, I will not get 90 per cent of him, but if I lose 1 phone number from my phone book, it will be 90 per cent intact.
Humans tend to have many somewhat independent values, like some may like fishing, snorkeling, girls, clouds, etc. If he lost one of them it is not a big deal, it is almost him and it happens all the time with real humans, as their predispositions could change overnight.
I highly recommend reading this.
I’m familiar with lots of the things Eliezer Yudkowsky has said about AI. That doesn’t mean I agree with them. Less Wrong has an unfortunate culture of not discussing topics once the Great Teacher has made a pronouncement.
Plus, I don’t think philosophytorres’ claim is obvious even if you accept Yudkowsky’s arguments.
From here.
OK, so do my best friend’s values constitute a 90% match? A 99.9% match? Do they pass the satisficing threshold?
Also, Eliezer’s boredom-free scenario sounds like a pretty good outcome to me, all things considered. If an AGI modified me so I could no longer get bored, and then replayed a peak experience for me for millions of years, I’d consider that a positive singularity. Certainly not a “catastrophe” in the sense that an earthquake is a catastrophe. (Well, perhaps a catastrophe of opportunity cost, but basically every outcome is a catastrophe of opportunity cost on a long enough timescale, so that’s not a very interesting objection.) The utility function is not up for grabs—I am the expert on my values, not the Great Teacher.
Here’s the abstract from his 2011 paper:
It sounds to me like Eliezer’s point is more about the complexity of values, not the need to prevent slight misalignment. In other words, Eliezer seems to argue here that a naively programmed definition of “positive value” constitutes a gross misalignment, NOT that a slight misalignment constitutes a catastrophic outcome.
Please think critically.
I think that small error inside a value description could result in bad result, but it is not so, if we have a list of independent values.
In phone example if I lose one digit from someone number, I will not get 90 per cent of him, but if I lose 1 phone number from my phone book, it will be 90 per cent intact.
Humans tend to have many somewhat independent values, like some may like fishing, snorkeling, girls, clouds, etc. If he lost one of them it is not a big deal, it is almost him and it happens all the time with real humans, as their predispositions could change overnight.