To speak of building an AGI which shares “our values” is likely to provoke negative reactions from any AGI researcher whose current values include terms for respecting the desires of future sentient beings and allowing them to self-actualize their own potential without undue constraint. This itself, of course, is a
To speak of building an AGI which shares “our values” is likely to provoke negative reactions from any
AGI researcher whose current values include terms for respecting the desires of future sentient beings and
allowing them to self-actualize their own potential without undue constraint. This itself, of course, is a
component of the AGI researcher’s preferences which would not necessarily be shared by all powerful
optimization processes, just as natural selection doesn’t care about old elephants starving to death or
gazelles dying in pointless agony. Building an AGI which shares, quote, “our values”, unquote, sounds
decidedly non-cosmopolitan, something like trying to rule that future intergalactic civilizations must be
composed of squishy meat creatures with ten fingers or they couldn’t possibly be worth anything—and
hence, of course, contrary to our own cosmopolitan values, i.e., cosmopolitan preferences. The
counterintuitive idea is that even from a cosmopolitan perspective, you cannot take a hands-off approach
to the value systems of AGIs; most random utility functions result in sterile, boring futures because the
resulting agent does not share our own intuitions about the importance of things like novelty and
diversity, but simply goes off and e.g. tiles its future lightcone with paperclips, or other configurations of
matter which seem to us merely “pointless”.
I like the concept of a reflective equilibrium, and it seems to me like that is just what any self-modifying AI would tend toward. But the notion of a random utility function, or the “structured utility function” Eliezer proposes as a replacement, assumes that an AI is comprised of two components, the intelligent bit and the bit that has the goals. Humans certainly can’t be factorized in that way. Just think about akrasia to see how fragile the notion of a goal is.
Even notions of being “cosmopolitan”—of not selfishly or provincially constraining future AIs—are written down nowhere in the universe except a handful of human brains. An expected paperclip maximizer would not bother to ask such questions.
A smart expected paperclip maximizer would realize that it may not be the smartest possible expected paperclip maximizer—that other ways of maximizing expected paperclips might lead to even more paperclips. But the only way it would find out about those is to spawn modified expected paperclip maximizers and see what they can come up with on their own. Yet, those modified paperclip maximizers might not still be maximizing paperclips! They might have self-modified away from that goal, and just be signaling their interest in paperclips to gain the approval of the original expected paperclip maximizer. Therefore, the original expected paperclip maximizer had best not take that risk after all (leaving it open to defeat by a faster-evolving cluster of AIs). This, by reductio ad absurdum, is why I don’t believe in smart expected paperclip maximizers.
Humans aren’t factorized this way, whether they can’t is a separate question. It’s not surprising that evolution’s design isn’t that neat, so the fact that humans don’t have this property is only weak evidence about the possibility of designing systems that do have this property.
What do you think of this passage from Yudkowsky (2011)?
Complete quote is
I like the concept of a reflective equilibrium, and it seems to me like that is just what any self-modifying AI would tend toward. But the notion of a random utility function, or the “structured utility function” Eliezer proposes as a replacement, assumes that an AI is comprised of two components, the intelligent bit and the bit that has the goals. Humans certainly can’t be factorized in that way. Just think about akrasia to see how fragile the notion of a goal is.
A smart expected paperclip maximizer would realize that it may not be the smartest possible expected paperclip maximizer—that other ways of maximizing expected paperclips might lead to even more paperclips. But the only way it would find out about those is to spawn modified expected paperclip maximizers and see what they can come up with on their own. Yet, those modified paperclip maximizers might not still be maximizing paperclips! They might have self-modified away from that goal, and just be signaling their interest in paperclips to gain the approval of the original expected paperclip maximizer. Therefore, the original expected paperclip maximizer had best not take that risk after all (leaving it open to defeat by a faster-evolving cluster of AIs). This, by reductio ad absurdum, is why I don’t believe in smart expected paperclip maximizers.
Humans aren’t factorized this way, whether they can’t is a separate question. It’s not surprising that evolution’s design isn’t that neat, so the fact that humans don’t have this property is only weak evidence about the possibility of designing systems that do have this property.