The key question is how much I trust the (hypothetical) CEV-extracting algorithm that developed the FAI to actually do what its programmers intended.
If I think it’s more reliable than my own bias-ridden thinking process, then if the FAI it produces does something that I reject—for example, starts disassembling alien civilizations to replace them with a human utopia—presumably I should be skeptical about my rejection. The most plausible interpretation of that event is that my rejection is a symptom of my cognitive biases.
Conversely, if I am not skeptical of my rejection—if I watch the FAI disassembling aliens and I say “No, this is not acceptable” and try to stop it—it follows that I don’t think the process is more reliable than my own thinking.
As I’ve said before, I suspect that an actual CEV-maximizing AI would do any number of things that quite a few humans (including me) would be horrified by, precisely because I suspect that quite a few humans (including me) would be horrified by the actual implications of their own values being maximally realized.
I suspect that quite a few humans (including me) would be horrified by the actual implications of their own values being maximally realized.
How exactly could you be “horrified” about that unless you were comparing some of your own values being maximally realized with some of your other values not being maximally realized?
In other words, it doesn’t make sense (doesn’t even mean anything!) to say that you would be horrified (isn’t that a bad thing?) to have your desires fulfilled (isn’t that a good thing?), unless you’re really just talking about some of your desires conflicting with some of your other desires.
Because humans’ values are not a coherent, consistent set. We execute an evolutionarily-determined grab-bag of adaptations; there is no reason to assume this grab-bag adds up to a more coherent whole than “don’t die out.” (And even that’s a teleological projection onto stuff just happening.)
If I am completely and consistently aware of what I actually value, then yes, my desires are equivalent to my values and it makes no sense to talk about satisfying one while challenging the other (modulo cases of values-in-tension, as you say, which isn’t what I’m talking about).
My experience is that people are not completely and consistently aware of what they actually value, and it would astonish me if I turned out to be the fortunate exception.
Humans frequently treat instrumental goals as though they were terminal. Indeed, I suspect that’s all we ever do.
But even if I’m wrong, and it turns out that there really are terminal values in there somewhere, then I expect that most humans aren’t aware of them and if some external system starts optimizing for them, and is willing to trade arbitrary amounts of a merely instrumental good in exchange for the terminal good it serves as a proxy for (as well it should), we’ll experience that as emotionally unpleasant and challenging.
I see reliability and friendliness as separate questions. An AI might possess epistemic and instrumental rationality that’s superior to ours, but not share our terminal values, in which case, I think it makes sense to regard it as reliable, but not friendly.
But the theory here, as I understand it, is that it’s possible to build a merely reliable system—a seed AI—that determines what humanity’s CEV is and constructs a self-improving target AI that maximizes it. That target AI is Friendly if the seed AI is reliable, and not otherwise.
So they aren’t entirely separable questions, either.
Updated and upvoted, but if you find yourself horrified by the AI’s actions, I think that would be fairly strong evidence that the AI had not been sufficiently reliable in extrapolating your values.
But one could argue that I ought not run such a seed AI in the first place until my confidence in its reliability was so high that even updating on that evidence would not be enough to make me distrust the target AI. (Certainly, I think EY would argue that.)
It seems analogous to the question of when I should doubt my own senses. There is some theoretical sense in which I should never do that: since the vast majority of my beliefs about the world are derived from my senses, it follows that when my beliefs contradict my senses I should trust my senses and doubt my beliefs. And in practice, that seems like the right thing to do most of the time.
But there are situations where the proper response to a perception is to doubt that its referent exists… to think “Yes, I’m seeing X, but no, X probably is not actually there to be seen.” They are rare, but recognizing them when they occur is important. (I’ve encountered this seriously only once in my life, shortly after my stroke, and successfully doubting it was… challenging.)
Similarly, there are situations where the proper response to a moral judgment is to doubt the moral intuitions on which it is based… to think “Yes, I’m horrified by X, but no, X probably is not actually horrible.”
Agreed, but if you do have very high confidence that you’ve made the AI reliable, and also a fairly reasoned view of your own utility function, I think you should be able to predict in advance with reasonable confidence that you won’t find yourself horrified by whatever it does. And I predict that if an AI subsumed intelligent aliens and subjected them to something they considered a terrible fate, I would be horrified.
(I’ve encountered this seriously only once in my life, shortly after my stroke, and successfully doubting it was… challenging.)
Please elaborate! It sounds interesting and it would be useful to hear how you were able to identify such a situation and successfully doubt your senses.
I’m not prepared to tell that story in its entirety here, though I appreciate your interest.
The short form is that I suffered significant brain damage and was intermittently delerious for the better part of a week, in the course of which I experienced both sensory hallucinations and a variety of cognitive failures.
The most striking of these had a fairly standard “call to prophecy” narrative, with the usual overtones of Great Significance and Presence and etc.
Doubting it mostly just boiled down to asking the question “Is it more likely that my experiences are isomorphic to external events, or that they aren’t?” The answer to that question wasn’t particularly ambiguous, under the circumstances.
The hard part was honestly asking that question, and being willing to focus on it carefully enough to arrive at an answer when my brain was running on square wheels, and being willing to accept the answer when it required rejecting some emotionally potent experiences.
The key question is how much I trust the (hypothetical) CEV-extracting algorithm that developed the FAI to actually do what its programmers intended.
If I think it’s more reliable than my own bias-ridden thinking process, then if the FAI it produces does something that I reject—for example, starts disassembling alien civilizations to replace them with a human utopia—presumably I should be skeptical about my rejection. The most plausible interpretation of that event is that my rejection is a symptom of my cognitive biases.
Conversely, if I am not skeptical of my rejection—if I watch the FAI disassembling aliens and I say “No, this is not acceptable” and try to stop it—it follows that I don’t think the process is more reliable than my own thinking.
As I’ve said before, I suspect that an actual CEV-maximizing AI would do any number of things that quite a few humans (including me) would be horrified by, precisely because I suspect that quite a few humans (including me) would be horrified by the actual implications of their own values being maximally realized.
How exactly could you be “horrified” about that unless you were comparing some of your own values being maximally realized with some of your other values not being maximally realized?
In other words, it doesn’t make sense (doesn’t even mean anything!) to say that you would be horrified (isn’t that a bad thing?) to have your desires fulfilled (isn’t that a good thing?), unless you’re really just talking about some of your desires conflicting with some of your other desires.
Because humans’ values are not a coherent, consistent set. We execute an evolutionarily-determined grab-bag of adaptations; there is no reason to assume this grab-bag adds up to a more coherent whole than “don’t die out.” (And even that’s a teleological projection onto stuff just happening.)
If I am completely and consistently aware of what I actually value, then yes, my desires are equivalent to my values and it makes no sense to talk about satisfying one while challenging the other (modulo cases of values-in-tension, as you say, which isn’t what I’m talking about).
My experience is that people are not completely and consistently aware of what they actually value, and it would astonish me if I turned out to be the fortunate exception.
Humans frequently treat instrumental goals as though they were terminal. Indeed, I suspect that’s all we ever do.
But even if I’m wrong, and it turns out that there really are terminal values in there somewhere, then I expect that most humans aren’t aware of them and if some external system starts optimizing for them, and is willing to trade arbitrary amounts of a merely instrumental good in exchange for the terminal good it serves as a proxy for (as well it should), we’ll experience that as emotionally unpleasant and challenging.
Solid answer, as far as I can see right now.
I see reliability and friendliness as separate questions. An AI might possess epistemic and instrumental rationality that’s superior to ours, but not share our terminal values, in which case, I think it makes sense to regard it as reliable, but not friendly.
That is certainly true.
But the theory here, as I understand it, is that it’s possible to build a merely reliable system—a seed AI—that determines what humanity’s CEV is and constructs a self-improving target AI that maximizes it. That target AI is Friendly if the seed AI is reliable, and not otherwise.
So they aren’t entirely separable questions, either.
Updated and upvoted, but if you find yourself horrified by the AI’s actions, I think that would be fairly strong evidence that the AI had not been sufficiently reliable in extrapolating your values.
Sure.
But one could argue that I ought not run such a seed AI in the first place until my confidence in its reliability was so high that even updating on that evidence would not be enough to make me distrust the target AI. (Certainly, I think EY would argue that.)
It seems analogous to the question of when I should doubt my own senses. There is some theoretical sense in which I should never do that: since the vast majority of my beliefs about the world are derived from my senses, it follows that when my beliefs contradict my senses I should trust my senses and doubt my beliefs. And in practice, that seems like the right thing to do most of the time.
But there are situations where the proper response to a perception is to doubt that its referent exists… to think “Yes, I’m seeing X, but no, X probably is not actually there to be seen.” They are rare, but recognizing them when they occur is important. (I’ve encountered this seriously only once in my life, shortly after my stroke, and successfully doubting it was… challenging.)
Similarly, there are situations where the proper response to a moral judgment is to doubt the moral intuitions on which it is based… to think “Yes, I’m horrified by X, but no, X probably is not actually horrible.”
Agreed, but if you do have very high confidence that you’ve made the AI reliable, and also a fairly reasoned view of your own utility function, I think you should be able to predict in advance with reasonable confidence that you won’t find yourself horrified by whatever it does. And I predict that if an AI subsumed intelligent aliens and subjected them to something they considered a terrible fate, I would be horrified.
Please elaborate! It sounds interesting and it would be useful to hear how you were able to identify such a situation and successfully doubt your senses.
I’m not prepared to tell that story in its entirety here, though I appreciate your interest.
The short form is that I suffered significant brain damage and was intermittently delerious for the better part of a week, in the course of which I experienced both sensory hallucinations and a variety of cognitive failures.
The most striking of these had a fairly standard “call to prophecy” narrative, with the usual overtones of Great Significance and Presence and etc.
Doubting it mostly just boiled down to asking the question “Is it more likely that my experiences are isomorphic to external events, or that they aren’t?” The answer to that question wasn’t particularly ambiguous, under the circumstances.
The hard part was honestly asking that question, and being willing to focus on it carefully enough to arrive at an answer when my brain was running on square wheels, and being willing to accept the answer when it required rejecting some emotionally potent experiences.