Personal identity/anticipated experience is a mechanism through which a huge chunk of preference is encoded in human minds, on an intuitive level. A lot of preference is expressed in terms of “future experience”, which breaks down once there is no unique referent for that concept in the future. Whenever you copy human minds, you also copy this mechanism, which virtually guarantees lack of reflective consistency in preference in humans.
Thought experiments with mind-copying effectively involve dramatically changing the agent’s values, but don’t emphasize this point, as if it’s a minor consideration. Getting around this particular implementation, directly to preference represented by it, and so being rational in situations of mind-copying, is not something humans are wired to be able to do.
Thought experiments with mind-copying effectively involve dramatically changing the agent’s values, but don’t emphasize this point, as if it’s a minor consideration.
Morendil’s comment made me realize that my example is directly analogous to your Counterfactual Mugging: in that thought experiment, Omega’s coin flip splits you into two copies (in two different possible worlds), and like in my example, the rational thing to do, in human terms, is to sacrifice your own interests to help your copy. To me, this analogy indicates that it’s not mind-copying that’s causing the apparent value changes, but rather Bayesian updating.
Getting around this particular implementation, directly to preference represented by it, and so being rational in situations of mind-copying, is not something humans are wired to be able to do.
I tend to agree with you, but I note that Eliezer disagrees.
Locating future personal experience is possible when we are talking about possible futures, and not possible when we are talking about the future containing multiple copies at the same time. Only in the second case does the mechanism for representing preference breaks down. The problem is not (primarily) in failure to assign preference to the right person, it’s in failure to assign it at all. Humans just get confused, don’t know what the correct preference is, and it’s not a question of not being able to shut up and calculate, as it’s not clear what the answer should be, and how to find it. More or less the same problem as with assigning value to groups of other people: should we care more when there are a lot people at stake, or the same about them all, but less about each of them? (“Shut up and divide”.)
In counterfactual mugging, there is a clear point of view (before the mugging/coin flip) from where preference is clearly represented, via intermediary of future personal experience, as seen from that time, so we can at least shut up and calculate. That’s not the issue I’m talking about.
While for some approaches to decision-making, it might not matter whether we are talking about multiplicative indexical uncertainty, or additive counterfactuals, the issue here is the concept of personal identity through which a chunk of preference is represented in human mind. Decision theories can handle situations where personal identity doesn’t make sense, but we’d still need to get preference about those situations from somewhere, and there is no clear assignment of it.
Some questions about fine distinctions in preference aren’t ever going to be answered by humans, we don’t have the capacity to see the whole picture.
I tend to agree with you, but I note that Eliezer disagrees.
Which brings up the question: suppose that your values are defined in terms of an ontology which is not merely false but actually logically inconsistent, though in a way that is too subtle for you to currently grasp. Is it rational to try to learn the logical truth, and thereby lose most or all of what you value? Should we try to hedge against such a possibility when designing a friendly AI? If so, how?
Is it rational to try to learn the logical truth, and thereby lose most or all of what you value? Should we try to hedge against such a possibility when designing a friendly AI? If so, how?
Do you want to lose what you value upon learning that you were confused? More realistically, the correct preference is to adapt the past preference to something that does make sense. More generally, if you should lose that aspect of preference, it means you prefer to do so; if you shouldn’t, it means you don’t prefer to do so. Whatever the case, doing what you prefer to do upon receiving new information is in accordance with what you prefer.
This is all tautologous, but you are seeing a conflict of interest somewhere, so I don’t think you’ve made the concepts involved in the situation explicit enough to recognize the tautologies.
Preference talks about what you should do, and what you do is usually real (until you pass to a next level).
so I don’t think you’ve made the concepts involved in the situation explicit enough to recognize the tautologies.
Perhaps an example will illustrate. The theist plans his life around doing God’s will: when he is presented with a persuasive argument from scripture that God’s will is for him to do X rather than Y, he will do X. Perhaps he has frequently adjusted his strategies when considering scripture, revelations (which are, in fact, hallucinations his subconscious generates), and Papal decree.
It seems that he loses a lot upon learning that God does not exist. As a matter of pure psychological fact, he will be depressed (probably). Moreover, suppose that he holds beliefs that are mutually contradictory, but only in subtle ways; perhaps he thinks that God is in complete control of all things in the world, and that God is all-loving (all good), but the world which he thinks he lives in manifestly contains a lot of suffering. (The Theodicy Problem).
It seems that the best thing for him is to remain ignorant of the paradox, and of his false, inconsistent and confused beliefs, and for events to transpire in a lucky way so that he never suffers serious material losses from his pathological decision-making.
Consider the claim that what a Friendly AI should do for such a person is the following: keep them unaware of the facts, and optimize within their framework of reality.
This seems to confuse stuff that happens to a human with decision theory. What happens with a human (in human’s thoughts, etc.) can’t be “contradictory” apart from a specific interpretation that names some things “contradictory”. This interpretation isn’t fundamentally interesting for the purposes of optimizing the stuff. The ontology problem is asked about the FAI, not about a person that is optimized by FAI. For FAI, a person is just a pattern in the environment, just like any other object, with stars and people and paperclips all fundamentally alike; the only thing that distinguishes them for FAI is what preference tells should be done in each case.
When we are talking about decision theory for FAI, especially while boxing the ontology inside the FAI, it’s not obvious how to connect that with particular interpretations of what happens in environment, nor should we try, really.
Now, speaking of people in environment, we might say that the theist is going to feel frustrated for some time upon realizing that they were confused for a long time. However I can’t imagine the whole process of deconverting to be actually not preferable, as compared to remaining confused (especially given that in the long run, the person will need to grow up). Even the optimal strategy is going to have identifiable negative aspects, but it may only make the strategy suboptimal if there is a better way. Also, for a lot of obvious negative aspects, such as negative emotions accompanying an otherwise desirable transition, FAI is going to invent a way of avoiding that aspect, if that’s desirable.
the only thing that distinguishes them for FAI is what preference tells should be done in each case.
And that the person might be the source of preference. This is fairly important. But, in any case, FAI theory is only here as an intuition pump for evaluating “what would the best thing be, according to this person’s implicit preferences?”
If it is possible to have preference-like things within a fundamentally contradictory belief system, and that’s all the human in question has, then knowing about the inconsistency might be bad.
And that the person might be the source of preference. This is fairly important.
This is actually wrong. Whatever the AI starts with is its formal preference, it never changes, it never depends on anything. That this formal preference was actually intended to copycat an existing pattern in environment is a statement about what sorts of formal preference it is, but it is enacted the same way, in accordance with what should be done in that particular case based on what formal preference tells. Thus, what you’ve highlighted in the quote is a special case, not an additional feature. Also, I doubt it can work this way.
But, in any case, FAI theory is only here as an intuition pump for evaluating “what would the best thing be, according to this person’s implicit preferences?”
True, but implicit preference is not something that person realizes to be preferable, and not something expressed in terms of confused “ontology” believed by that person. The implicit preference is a formal object that isn’t built from fuzzy patterns interpreted in the person’s thoughts. When you speak of “contradictions” in person’t beliefs, you are speaking on a wrong level of abstraction, like if you were discussing parameters in a clustering algorithm as being relevant to reliable performance of hardware on which that algorithm runs.
If it is possible to have preference-like things within a fundamentally contradictory belief system, and that’s all the human in question has, then knowing about the inconsistency might be bad.
A belief system can’t be “fundamentally contradictory” because it’s not “fundamental” to begin with. What do you mean by “bad”? Bad according to what? It doesn’t follow from confused thoughts that preference is somehow brittle.
A Friendly AI might also resolve the situation by presenting itself as god, eliminating suffering in the world, and then giving out genuine revelations with adequately good advice.
Eliminating the appearance of suffering in the world would probably be bad for such a theist. He spends much of his time running Church Bazaars to raise money for charity. Like many especially dedicated charity workers, he is somewhat emotionally and axiologically dependent upon the existence of the problem he is working against.
In that case, eliminate actual suffering as fast as possible, then rapidly reduce the appearance of suffering in ways calculated to make it seem like the theist’s own actions are a significant factor, and eventually substitute some other productive activity.
Which brings up the question: suppose that your values are defined in terms of an ontology which is not merely false but actually logically inconsistent, though in a way that is too subtle for you to currently grasp. Is it rational to try to learn the logical truth, and thereby lose most or all of what you value? Should we try to hedge against such a possibility when designing a friendly AI? If so, how?
You do not lose any options by gaining more knowledge. If the optimal response to have when your values are defined in terms of an inconsistent ontology is to go ahead and act as if the ontology is consistent then you can still choose to do so even once you find out the dark secret. You can only gain from knowing more.
If your values are such that they do not even allow a mechanism for creating an best effort approximation of values in the case of ontological enlightenment then you are out of luck no matter what you do. Even if you explicitly value ignorance of the fact that nothing you value can have coherent value, the incoherency of your value system makes the ignorance value meaningless too.
Should we try to hedge against such a possibility when designing a friendly AI? If so, how?
Make the most basic parts of the value system in an ontology that has as little chance as possible of being inconsistent. Reference to actual humans can ensure that a superintelligent FAI’s value system will be logically consistent if it is in fact possible for a human to have a value system defined in a consistent ontology. If that is not possible then humans are in a hopeless position. But at least I (by definition) wouldn’t care.
If your values are such that they do not even allow a mechanism for creating an best effort approximation of values in the case of ontological enlightenment then you are out of luck no matter what you do.
If preference is expressed in terms of what you should do, not what’s true about the world, new observations never influence preference, so we can fix it at the start and never revise it (which is an important feature for constructing FAI, since you only ever have a hand in its initial construction).
(To whoever downvoted this without comment—it’s not as stupid an idea as it might sound; what’s true about the world doesn’t matter for preference, but it does matter for decision-making, as decisions are made depending on what’s observed. By isolating preference from influence of observations, we fix it at the start, but since it determines what should be done depending on all possible observations, we are not ignoring reality.)
If preference is expressed in terms of what you should do, not what’s true about the world, new observations never influence preference, so we can fix it at the start and never revise it (which is an important feature for constructing FAI, since you only ever have a hand in its initial construction).
In the situation described by Roko the agent has doubt about its understanding of the very ontology that its values are expressed in. If it were an AI that would effectively mean that we designed it using mathematics that we thought was consistent but turns out to have a flaw. The FAI has self improved to a level where it has a suspicion that the ontology that is used to represent its value system is internally inconsistent and must decide whether to examine the problem further. (So we should have been able to fix it at the start but couldn’t because we just weren’t smart enough.)
The FAI has self improved to a level where it has a suspicion that the ontology that is used to represent its value system is internally inconsistent and must decide whether to examine the problem further.
If its values are not represented in terms of an “ontology”, this won’t happen.
You do not lose any options by gaining more knowledge. If the optimal response to have when your values are defined in terms of an inconsistent ontology is to go ahead and act as if the ontology is consistent then you can still choose to do so even once you find out the dark secret. You can only gain from knowing more.
See the example of the theist (above). Do you really think that the best possible outcome for him involves knowing more?
How could it be otherwise? His confusion doesn’t define his preference, and his preference doesn’t set this particular form of confusion as being desirable. Maybe Wei Dai’s post is a better way to communicate the distinction I’m making: A Master-Slave Model of Human Preferences (though it’s different, the distinction is there as well).
See the example of the theist (above). Do you really think that the best possible outcome for him involves knowing more?
No, I think his values are defined in terms of a consistent ontology in which ignorance may result in a higher value outcome. If his values could not in fact be expresesd consistently then I do hold that (by definition) he doesn’t lose by knowing more.
You might be able to get a scenario like this without mind-copying by using a variety of Newcomb’s Problem.
You wake up without any memories of the previous day. You then see Omega in front of you, holding two boxes. Omega explains that if you pick the first box, you will be tortured briefly now. If you pick the second box, you won’t be.
However, Omega informs you that he anticipated which box you would choose. If he predicted you’d pick the first box, the day before yesterday he drugged you so you’d sleep through the whole day. If he predicted you’d pick the second box he tortured you for a very long period of time the previous day and erased your memory of it afterward. He acknowledges that torture one doesn’t remember afterwards isn’t as bad as torture one does, and assures you that he knows this and extended the length of the previous day’s torture to compensate.
It seem to me like there’d be a strong temptation to pick the second box. However, your self from a few days ago would likely pay to be able to stop you from doing this.
to me, this analogy indicates that it’s not mind-copying that’s causing the apparent value changes, but rather Bayesian updating.
Is that an area in which a TDT would describe the appropriate response using different words to a UDT, even if they suggest the same action? I’m still trying to clarify the difference between UDT, TDT and my own understanding of DT. I would not describe the-updating-that-causes-the-value-changes as ‘bayesian updating’, rather ‘naive updating’. (But this is a terminology preference.)
Personal identity/anticipated experience is a mechanism through which a huge chunk of preference is encoded in human minds, on an intuitive level. A lot of preference is expressed in terms of “future experience”, which breaks down once there is no unique referent for that concept in the future. Whenever you copy human minds, you also copy this mechanism, which virtually guarantees lack of reflective consistency in preference in humans.
Thought experiments with mind-copying effectively involve dramatically changing the agent’s values, but don’t emphasize this point, as if it’s a minor consideration. Getting around this particular implementation, directly to preference represented by it, and so being rational in situations of mind-copying, is not something humans are wired to be able to do.
Morendil’s comment made me realize that my example is directly analogous to your Counterfactual Mugging: in that thought experiment, Omega’s coin flip splits you into two copies (in two different possible worlds), and like in my example, the rational thing to do, in human terms, is to sacrifice your own interests to help your copy. To me, this analogy indicates that it’s not mind-copying that’s causing the apparent value changes, but rather Bayesian updating.
I tend to agree with you, but I note that Eliezer disagrees.
Locating future personal experience is possible when we are talking about possible futures, and not possible when we are talking about the future containing multiple copies at the same time. Only in the second case does the mechanism for representing preference breaks down. The problem is not (primarily) in failure to assign preference to the right person, it’s in failure to assign it at all. Humans just get confused, don’t know what the correct preference is, and it’s not a question of not being able to shut up and calculate, as it’s not clear what the answer should be, and how to find it. More or less the same problem as with assigning value to groups of other people: should we care more when there are a lot people at stake, or the same about them all, but less about each of them? (“Shut up and divide”.)
In counterfactual mugging, there is a clear point of view (before the mugging/coin flip) from where preference is clearly represented, via intermediary of future personal experience, as seen from that time, so we can at least shut up and calculate. That’s not the issue I’m talking about.
While for some approaches to decision-making, it might not matter whether we are talking about multiplicative indexical uncertainty, or additive counterfactuals, the issue here is the concept of personal identity through which a chunk of preference is represented in human mind. Decision theories can handle situations where personal identity doesn’t make sense, but we’d still need to get preference about those situations from somewhere, and there is no clear assignment of it.
Some questions about fine distinctions in preference aren’t ever going to be answered by humans, we don’t have the capacity to see the whole picture.
Which brings up the question: suppose that your values are defined in terms of an ontology which is not merely false but actually logically inconsistent, though in a way that is too subtle for you to currently grasp. Is it rational to try to learn the logical truth, and thereby lose most or all of what you value? Should we try to hedge against such a possibility when designing a friendly AI? If so, how?
Do you want to lose what you value upon learning that you were confused? More realistically, the correct preference is to adapt the past preference to something that does make sense. More generally, if you should lose that aspect of preference, it means you prefer to do so; if you shouldn’t, it means you don’t prefer to do so. Whatever the case, doing what you prefer to do upon receiving new information is in accordance with what you prefer.
This is all tautologous, but you are seeing a conflict of interest somewhere, so I don’t think you’ve made the concepts involved in the situation explicit enough to recognize the tautologies.
Preference talks about what you should do, and what you do is usually real (until you pass to a next level).
Perhaps an example will illustrate. The theist plans his life around doing God’s will: when he is presented with a persuasive argument from scripture that God’s will is for him to do X rather than Y, he will do X. Perhaps he has frequently adjusted his strategies when considering scripture, revelations (which are, in fact, hallucinations his subconscious generates), and Papal decree.
It seems that he loses a lot upon learning that God does not exist. As a matter of pure psychological fact, he will be depressed (probably). Moreover, suppose that he holds beliefs that are mutually contradictory, but only in subtle ways; perhaps he thinks that God is in complete control of all things in the world, and that God is all-loving (all good), but the world which he thinks he lives in manifestly contains a lot of suffering. (The Theodicy Problem).
It seems that the best thing for him is to remain ignorant of the paradox, and of his false, inconsistent and confused beliefs, and for events to transpire in a lucky way so that he never suffers serious material losses from his pathological decision-making.
Consider the claim that what a Friendly AI should do for such a person is the following: keep them unaware of the facts, and optimize within their framework of reality.
This seems to confuse stuff that happens to a human with decision theory. What happens with a human (in human’s thoughts, etc.) can’t be “contradictory” apart from a specific interpretation that names some things “contradictory”. This interpretation isn’t fundamentally interesting for the purposes of optimizing the stuff. The ontology problem is asked about the FAI, not about a person that is optimized by FAI. For FAI, a person is just a pattern in the environment, just like any other object, with stars and people and paperclips all fundamentally alike; the only thing that distinguishes them for FAI is what preference tells should be done in each case.
When we are talking about decision theory for FAI, especially while boxing the ontology inside the FAI, it’s not obvious how to connect that with particular interpretations of what happens in environment, nor should we try, really.
Now, speaking of people in environment, we might say that the theist is going to feel frustrated for some time upon realizing that they were confused for a long time. However I can’t imagine the whole process of deconverting to be actually not preferable, as compared to remaining confused (especially given that in the long run, the person will need to grow up). Even the optimal strategy is going to have identifiable negative aspects, but it may only make the strategy suboptimal if there is a better way. Also, for a lot of obvious negative aspects, such as negative emotions accompanying an otherwise desirable transition, FAI is going to invent a way of avoiding that aspect, if that’s desirable.
And that the person might be the source of preference. This is fairly important. But, in any case, FAI theory is only here as an intuition pump for evaluating “what would the best thing be, according to this person’s implicit preferences?”
If it is possible to have preference-like things within a fundamentally contradictory belief system, and that’s all the human in question has, then knowing about the inconsistency might be bad.
This is actually wrong. Whatever the AI starts with is its formal preference, it never changes, it never depends on anything. That this formal preference was actually intended to copycat an existing pattern in environment is a statement about what sorts of formal preference it is, but it is enacted the same way, in accordance with what should be done in that particular case based on what formal preference tells. Thus, what you’ve highlighted in the quote is a special case, not an additional feature. Also, I doubt it can work this way.
True, but implicit preference is not something that person realizes to be preferable, and not something expressed in terms of confused “ontology” believed by that person. The implicit preference is a formal object that isn’t built from fuzzy patterns interpreted in the person’s thoughts. When you speak of “contradictions” in person’t beliefs, you are speaking on a wrong level of abstraction, like if you were discussing parameters in a clustering algorithm as being relevant to reliable performance of hardware on which that algorithm runs.
A belief system can’t be “fundamentally contradictory” because it’s not “fundamental” to begin with. What do you mean by “bad”? Bad according to what? It doesn’t follow from confused thoughts that preference is somehow brittle.
A Friendly AI might also resolve the situation by presenting itself as god, eliminating suffering in the world, and then giving out genuine revelations with adequately good advice.
Eliminating the appearance of suffering in the world would probably be bad for such a theist. He spends much of his time running Church Bazaars to raise money for charity. Like many especially dedicated charity workers, he is somewhat emotionally and axiologically dependent upon the existence of the problem he is working against.
In that case, eliminate actual suffering as fast as possible, then rapidly reduce the appearance of suffering in ways calculated to make it seem like the theist’s own actions are a significant factor, and eventually substitute some other productive activity.
To get back at this point: This depends on how we understand “values”. Let’s not conceptualize values is being defined in terms of an “ontology”.
You do not lose any options by gaining more knowledge. If the optimal response to have when your values are defined in terms of an inconsistent ontology is to go ahead and act as if the ontology is consistent then you can still choose to do so even once you find out the dark secret. You can only gain from knowing more.
If your values are such that they do not even allow a mechanism for creating an best effort approximation of values in the case of ontological enlightenment then you are out of luck no matter what you do. Even if you explicitly value ignorance of the fact that nothing you value can have coherent value, the incoherency of your value system makes the ignorance value meaningless too.
Make the most basic parts of the value system in an ontology that has as little chance as possible of being inconsistent. Reference to actual humans can ensure that a superintelligent FAI’s value system will be logically consistent if it is in fact possible for a human to have a value system defined in a consistent ontology. If that is not possible then humans are in a hopeless position. But at least I (by definition) wouldn’t care.
If preference is expressed in terms of what you should do, not what’s true about the world, new observations never influence preference, so we can fix it at the start and never revise it (which is an important feature for constructing FAI, since you only ever have a hand in its initial construction).
(To whoever downvoted this without comment—it’s not as stupid an idea as it might sound; what’s true about the world doesn’t matter for preference, but it does matter for decision-making, as decisions are made depending on what’s observed. By isolating preference from influence of observations, we fix it at the start, but since it determines what should be done depending on all possible observations, we are not ignoring reality.)
In the situation described by Roko the agent has doubt about its understanding of the very ontology that its values are expressed in. If it were an AI that would effectively mean that we designed it using mathematics that we thought was consistent but turns out to have a flaw. The FAI has self improved to a level where it has a suspicion that the ontology that is used to represent its value system is internally inconsistent and must decide whether to examine the problem further. (So we should have been able to fix it at the start but couldn’t because we just weren’t smart enough.)
If its values are not represented in terms of an “ontology”, this won’t happen.
See the example of the theist (above). Do you really think that the best possible outcome for him involves knowing more?
How could it be otherwise? His confusion doesn’t define his preference, and his preference doesn’t set this particular form of confusion as being desirable. Maybe Wei Dai’s post is a better way to communicate the distinction I’m making: A Master-Slave Model of Human Preferences (though it’s different, the distinction is there as well).
No, I think his values are defined in terms of a consistent ontology in which ignorance may result in a higher value outcome. If his values could not in fact be expresesd consistently then I do hold that (by definition) he doesn’t lose by knowing more.
You might be able to get a scenario like this without mind-copying by using a variety of Newcomb’s Problem.
You wake up without any memories of the previous day. You then see Omega in front of you, holding two boxes. Omega explains that if you pick the first box, you will be tortured briefly now. If you pick the second box, you won’t be.
However, Omega informs you that he anticipated which box you would choose. If he predicted you’d pick the first box, the day before yesterday he drugged you so you’d sleep through the whole day. If he predicted you’d pick the second box he tortured you for a very long period of time the previous day and erased your memory of it afterward. He acknowledges that torture one doesn’t remember afterwards isn’t as bad as torture one does, and assures you that he knows this and extended the length of the previous day’s torture to compensate.
It seem to me like there’d be a strong temptation to pick the second box. However, your self from a few days ago would likely pay to be able to stop you from doing this.
Is that an area in which a TDT would describe the appropriate response using different words to a UDT, even if they suggest the same action? I’m still trying to clarify the difference between UDT, TDT and my own understanding of DT. I would not describe the-updating-that-causes-the-value-changes as ‘bayesian updating’, rather ‘naive updating’. (But this is a terminology preference.)
My understanding is that TDT would not press the button, just like it wouldn’t give $100 to the counterfactual mugger.
Thanks. So they actually do lead to different decisions? That is good to know… but puts me one step further away from confidence!
I wish I could upvote twice as this is extremely important.