This has been a source of confusion to me about the theory since I first encountered it, actually.
Given that this hypothetical CEV-extracting process gets results that aren’t necessarily anything that any individual actually wants, how do we tell the difference between an actual CEV-extracting process and something that was intended as a CEV-extracting process but that, due to a couple of subtle bugs in its code, is actually producing something other than its target’s CEV?
Is the idea that humanity’s actual CEV is something that, although we can’t necessarily come up with it ourselves, is so obviously the right answer once it’s pointed out to us that we’ll all nod our heads and go “Of course!” in unison?
Or is there some other testable property that only HACEV has? What property, and how do we test for it?
Because without such a testable property, I really don’t see why we believe flipping the switch on the AI that instantiates it is at all safe.
I have visions of someone perusing the resulting CEV assembled by the seed AI and going “Um… wait. If I’m understanding this correctly, the AI you instantiate to implement CEV will cause us all to walk around with watermelons on their feet.”
“Yes,” replies the seed AI, “that’s correct. It appears that humans really would want that, given enough time to think together about their footwear preferences.”
“Oh… well, OK,” says the peruser. “If you say so...”
In light of some later comment-threads on related subjects, and in the absence of any direct explanations, I tentatively (20-40% confidence) conclude that the attitude is that the process that generates the code that extracts the CEV that implements the FAI has to be perfect, in order to ensure that the FAI is perfect, which is important because even an epsilon deviation from perfection multiplied by the potential utility of a perfect FAI represents a huge disutility that might leave us vomiting happily on the sands of Mars.
And since testing is not a reliable process for achieving perfection, merely for reducing defects to epsilon, it seems to follow that testing simply isn’t relevant. We don’t test the CEV-generator, by this view; rather we develop it in such a way that we know it’s correct.
And once we’ve done that, we should be more willing to trust the CEV-generator’s view of what we really want than our own view (which is demonstrably unreliable).
So if it turns out to involve wearing watermelons on our feet (or living gender-segregated lives on different planets, or whatever it turns out to be) we should accept that that really is our extrapolated volition, and be grateful, even if our immediate emotional reaction is confusion, disgust, or dismay.
I hasten to add that I’m not supporting this view, just trying to understand it.
This has been a source of confusion to me about the theory since I first encountered it, actually.
Given that this hypothetical CEV-extracting process gets results that aren’t necessarily anything that any individual actually wants, how do we tell the difference between an actual CEV-extracting process and something that was intended as a CEV-extracting process but that, due to a couple of subtle bugs in its code, is actually producing something other than its target’s CEV?
Is the idea that humanity’s actual CEV is something that, although we can’t necessarily come up with it ourselves, is so obviously the right answer once it’s pointed out to us that we’ll all nod our heads and go “Of course!” in unison?
Or is there some other testable property that only HACEV has? What property, and how do we test for it?
Because without such a testable property, I really don’t see why we believe flipping the switch on the AI that instantiates it is at all safe.
I have visions of someone perusing the resulting CEV assembled by the seed AI and going “Um… wait. If I’m understanding this correctly, the AI you instantiate to implement CEV will cause us all to walk around with watermelons on their feet.”
“Yes,” replies the seed AI, “that’s correct. It appears that humans really would want that, given enough time to think together about their footwear preferences.”
“Oh… well, OK,” says the peruser. “If you say so...”
Surely I’m missing something?
In light of some later comment-threads on related subjects, and in the absence of any direct explanations, I tentatively (20-40% confidence) conclude that the attitude is that the process that generates the code that extracts the CEV that implements the FAI has to be perfect, in order to ensure that the FAI is perfect, which is important because even an epsilon deviation from perfection multiplied by the potential utility of a perfect FAI represents a huge disutility that might leave us vomiting happily on the sands of Mars.
And since testing is not a reliable process for achieving perfection, merely for reducing defects to epsilon, it seems to follow that testing simply isn’t relevant. We don’t test the CEV-generator, by this view; rather we develop it in such a way that we know it’s correct.
And once we’ve done that, we should be more willing to trust the CEV-generator’s view of what we really want than our own view (which is demonstrably unreliable).
So if it turns out to involve wearing watermelons on our feet (or living gender-segregated lives on different planets, or whatever it turns out to be) we should accept that that really is our extrapolated volition, and be grateful, even if our immediate emotional reaction is confusion, disgust, or dismay.
I hasten to add that I’m not supporting this view, just trying to understand it.