There is going to be value drift even if we get an FAI. Isn’t that inherent in extrapolated volition? We don’t really want our current values, we want the values we’ll have after being smarter and having time to think deeply about them. The route of WBE simply takes the guess work out: actually make people smarter, and then see what the drifted values are. Of course, it’s important to keep a large, diverse culture in the process, so that the whole can error correct for individuals that go off the deep end, analogous to why extrapolated volition would be based on the entire human population rather than a single person.
Here is a potentially more productive way of seeing this situation: We do want our current preferences to be made reality (because that’s what the term preference describes), but we do not know what our preferences look like, part of the reason being that we are not smart enough and do not have enough time to think about what they are. In this view, our preferences are not necessarily going to drift if we figure out how to refer to human preference as a formal object and if we build machines that use this object to choose what to do — and in this view, we certainly don’t want our preferences to drift.
On the other hand, WBE does not “simply take the guess work out”. It may be the case that the human mind is built such that “making people smarter” is feasible without changing preference much, but we don’t know that this is the case. As long as we do not have a formal theory of preference, we cannot strongly believe this about any given intervention – and if we do have such a theory, then there exist better uses for this knowledge.
We do want our current preferences to be made reality (because that’s what the term preference describes)
Yes, but one of our preferences may well be that we are open to an evolution of our preferences. And, whether or not that is one of our preferences, it certainly is the cases that preferences do evolve over time, and that many consider that a fundamental aspect of the human condition.
It may be the case that the human mind is built such that “making people smarter” is feasible without changing preference much, but we don’t know that this is the case.
I agree we don’t know that is the case, and would assume that it isn’t.
Yes, but one of our preferences may well be that we are open to an evolution of our preferences. And, whether or not that is one of our preferences, it certainly is the cases that preferences do evolve over time, and that many consider that a fundamental aspect of the human condition.
Obviously I meant ‘evolution’ in the sense of change over time, not change specifically induced by natural selection.
As to a deterministic criterion, I agree that such a thing is probably possible. But… so what? I’m not arguing that FAI isn’t possible. The topic at hand is FAI research relative to WBE. I’m assuming a priori that both are possible. The question is which basket should get more eggs.
But… so what? I’m not arguing that FAI isn’t possible. The topic at hand is FAI research relative to WBE. I’m assuming a priori that both are possible. The question is which basket should get more eggs.
You said:
Yes, but one of our preferences may well be that we are open to an evolution of our preferences.
This is misuse of the term “preference”. “Preference”, in the context of this discussion, refers specifically to that which isn’t to be changed, ever. This point isn’t supposed to be related to WBE vs. FAI discussion, it’s about a tool (the term “preference”) used in leading this discussion.
Your definition is too narrow for me to accept. Humans are complicated. I doubt we have a core set of “preferences” (by your definition) which can be found with adequate introspection. The very act of introspection itself changes the human and potentially their deepest preferences (normal definition)!
I have some preferences which satisfy your definition, but I wouldn’t consider them my core, underlying preferences. The vast majority of preferences I hold do not qualify. I’m perfectly OK with them changing over time, even the ones that guide the overarching path of my life. Yes, the change in preferences is often caused by other preferences, but to think that this causal chain can be traced back to a core preference is unjustified, in my opinion. There could just as well be closed loops in the causal tree.
You are disputing definitions! Of course, there are other natural ways to give meaning to the word “preference”, but they are not as useful in discussing FAI as the comprehensive unchanging preference. It’s not supposed to have much in common with likes or wants, and with their changes, though it needs to, in particular, describe what they should be, and how they should change. Think of your preference as that particular formal goal system that it is optimal, from your point of view (on reflection, if you knew more, etc.), to give to a Strong AI.
Your dislike for application of the label “preference” to this concept, and ambiguity that might introduce, needs to be separated from consideration of the concept itself.
I specifically dispute the usefulness of your definition. It may be a useful definition in the context of FAI theory. We aren’t discussing FAI theory.
And, to be fair, you were originally the one disputing definitions. In my post I used the standard definition of ‘preference’, which you decided was ‘wrong’, saying
This is misuse of the term “preference”
rather than accepting the implied (normal!) definition I had obviously used.
Regardless, it seems unlikely we’ll be making any progress on the on-topic discussion even if we resolve this quibble.
I specifically dispute the usefulness of your definition. It may be a useful definition in the context of FAI theory. We aren’t discussing FAI theory.
But we do. Whether a particular action is going to end well for humanity is a core consideration in Friendliness. When you say
The route of WBE simply takes the guess work out: actually make people smarter, and then see what the drifted values are.
if it’s read as implying that this road is OK, it is a factual claim about how preferable (in my sense) the outcome is going to be. The concept of preference (in my sense) is central to evaluating the correctness of your factual claim.
The concept of preference (in my sense) is central to evaluating the correctness of your factual claim.
Your concept of preference is one way of evaluating the correctness of my claim, I agree. If you can resolve the complex web of human preferences (in my sense) into a clean, non-contradictory, static preference system (your sense) then you can use that system to judge the value of the hypothetical future in which WBE research overran FAI research.
It’s not clear to me that this is the only way to evaluate my claim, or that it is even a reasonable way. My understanding of FAI is that arriving at such a resolution of human preferences is a central ingredient to building an FAI, hence using your method to evaluate my claim would require more progress on FAI. But the entire point of this discussion is to decide if we should be pushing harder for progress on FAI or WBE. I’ll grant that this is a point in favor for FAI—that it allows for a clearer evaluation of the very problem we’re discussing—but, beyond that, I think we must rely on the actual preferences we have access to now (in my sense: the messy, human ones) to further our evaluations of FAI and WBE.
It’s not clear to me that this is the only way to evaluate my claim,
or that it is even a reasonable way. My understanding of FAI is that
arriving at such a resolution of human preferences is a central
ingredient to building an FAI, hence using your method to evaluate my
claim would require more progress on FAI.
If your statement (“The route of WBE simply takes the guess work out”)
were a comparison between two routes similar in approach, e.g. WBE and
neuroenhancement, then you could argue that a better formal
understanding of preference would be required before we could use the
idea of “precise preference” to argue for one approach or the other.
Since we are comparing one option which does not try to capture
preference precisely with an option that does, it does not matter what
exactly precise preference says about the second option: Whatever
statement our precise preferences make, the second option tries to capture it
whereas the first option makes no such attempt.
The first option tries to capture our best current guess as to our fundamental preference. It then updates the agent (us) based on that guess. Afterwards the next guess as to our fundamental preference is likely different, so the process iterates. The iteration is trying to evolve towards what the agent thinks is its exact preference. The iteration is simply doing so to some sort of “first order” approximation.
For the first option, I think self-modification under the direction of current, apparent preferences should be done with extreme caution, so as to get a better ‘approximation’ at each step. For the second option though, it’s hard for me to imagine ever choosing to self-modify into an agent with exact, unchanging preferences.
The first option tries to capture our best current guess as to our fundamental preference. It then updates the agent (us) based on that guess.
This guess may be awful. The process of emulation and attempts to increase the intelligence of the emulations may introduce subtle psychological changes that could affect the preferences of the persons involved.
For subsequent changes based on “trying to evolve towards what the agent thinks is its exact preference” I see two options: Either they are like the first change, open to the possibility of being arbitrarily awful due to the fact that we do not have much introspective insight into the nature of our preferences, and step by step we lose part of what we value — or subsequent changes consist of the formalization and precise capture of the object preference, in which case the situation must be judged depending on how much value was lost in the first step vs how much value was gained by having emulations work on the project of formalization.
For the second option though, it’s hard for me to imagine ever choosing to self-modify into an agent with exact, unchanging preferences.
This is not the proposal under discussion. The proposal is to build a tool that ensures that things develop according to our wishes. If it turns out that our preferred (in the exact, static sense) route of development is through a number of systems that are not reflectively consistent themselves, then this route will be realized.
It may be horribly awful, yes. The question is “how likely is it be awful?”
If FAI research can advance fast enough then we will have the luxury of implementing a coherent preference system that will guarantee the long term stability of our exact preferences. In an ideal world that would be the path we took. In the real world there is a downside to the FAI path: it may take too long. The benefit of other paths is that, although they would have some potential to fail even if executed in time, they offer a potentially faster time table.
I’ll reiterate: yes, of course FAI would be better than WBE, if both were available. No, WBE provides no guarantee and could lead to horrendous preference drift. The questions are: how likely is WBE to go wrong? how long is FAI likely to take? how long is WBE likely to take? And, ultimately, combining the answers to those questions together: where should we be directing our research?
Your post points out very well that WBE might go wrong. It gives no clue to the likelihood though.
Good, this is progress. Your comment clarified your position greatly. However, I do not know what you mean by “how long is WBE likely to take?” — take until what happens?
The amount of time until we have high fidelity emulations of human brains. At that point we can start modifying/enhancing humans, seeking to create a superintelligence or at least sufficiently intelligent humans that can then create an FAI. The time from first emulation to superintelligence is nonzero, but is probably small compared to the time to first emulation. If we have reason to believe that the additional time is not small we should factor in our predictions for it as well.
My conclusion from this discussion is that our disagreement lies in the probability we assign that uploads can be applied safely to FAI as opposed to generating more existential risk. I do not see how to resolve this disagreement right now. I agree with your statement that we need to make sure that those involved in running uploads understand the problem of preserving human preference.
We do understand something about exact preferences in general, without knowing which one of them is ours. In particular, we do know that drifting from whatever preference we have is not preferable.
I agree. If our complex preferences can be represented as exact preferences then any drift from those exact preferences would be necessarily bad. However, it’s not clear to me that we actually would be drifting from our exact preference were we to follow the path of WBE.
It’s clear that the preferences we currently express most likely aren’t our exact preferences. The path of WBE could potentially lead to humans with fundamentally different exact preferences (bad), or it could simply lead to humans with the same exact preferences but with a different, closer expression of them in the surface preferences they actually present and are consciously aware of (good). Or the path could lead to someplace in between, obviously. Any drift is bad, I agree, but small enough drift could be acceptable if the trade off is good enough (such as preventing a negative singularity).
By the way, I move to label your definition “exact preference” and mine “complex preference”. Unless the context is clear, in which case we can just write “preference”. Thoughts?
And, to be fair, you were originally the one disputing definitions. In my post I used the standard definition of ‘preference’, which you decided was ‘wrong’, [...] rather than accepting the implied (normal!) definition I had obviously used.
You are right, I was wrong to claim authority over the meaning of the term as you used it. The actual problem was in you misinterpreting its use in andreas’s comment, where it was used in my sense:
We do want our current preferences to be made reality (because that’s what the term preference describes)
There is going to be value drift even if we get an FAI. Isn’t that inherent in extrapolated volition?
No. Progress and development may be part of human preference, but it is entirely OK for a fixed preference to specify progress happening in a particular way, as opposed to other possible ways. Furthermore, preference can be fixed and still not knowable in advance (so that there are no spoilers, and moral progress happens through your effort and not dictated “from above”).
It’s not possible to efficiently find out some properties of a program, even if you have its whole source code; this source code doesn’t change, but the program runs—develops—in novel and unexpected ways. Or course, the unexpected needs to be knowably good, not just “unexpected” (see for example Expected Creative Surprises).
I agree that such a fixed preference system is possible. But I don’t think that it needs to be implemented in order for “moral progress” to be indefinitely sustainable in a positive fashion. I think humans are capable of guiding their own moral progress without their hands being held. Will the result be provably friendly? No, of course not. The question is how likely is the result to be friendly, and is this likelihood great enough that it offsets the negatives associated with FAI research (namely the potentially very long timescales needed).
I think humans are capable of guiding their own moral progress without their hands being held. Will the result be provably friendly? No, of course not. The question is how likely is the result to be friendly
The strawman of “provable friendliness” again. It’s not about holding ourselves to an inadequately high standard, it’s about figuring out what’s going on, in any detail. (See this comment.)
If we accept that preference is complex (holds a lot of data), and that detail in preference matters (losing a relatively small portion of this data is highly undesirable), then any value drift is bad, and while value drift is not rigorously controlled, it’s going to lead its random walk further and further away from the initial preference. As a result, from the point of view of the initial preference, the far future is pretty much lost, even if each individual step of the way doesn’t look threatening. The future agency won’t care about the past preference, and won’t reverse to it, because as a result of value drift it already has different preference, and for it returning to the past is no longer preferable. This system isn’t stable, deviations in preference don’t correct themselves, if the deviated-preference agency has control.
I fail to see how my post was a straw man. I was pointing out a deficiency in what I am supporting, not what you are supporting.
This system isn’t stable, deviations in preference don’t correct themselves, if the deviated-preference agency has control.
I disagree that we know this. Certainly the system hasn’t stabilized yet, but how can you make such a broad statement about the future evolution of human preference? And, in any case, even if there were no ultimate attractor in the system, so what? Human preferences have changed over the centuries. My own preferences have changed over the years. I don’t think anyone is arguing this is a bad thing. Certainly, we may be able to build a system that replaces our “sloppy” method of advancement for a deterministic system with an immutable set of preferences at its core. I disagree this is necessarily superior to letting preferences evolve in the same way they have been, free of an overseer. But that disagreement of ours is still off topic.
The topic is whether FAI or WBE research is better for existential risk reduction. The pertinent question is what are the likelihoods of each leading to what we would consider a positive singularity, and, more importantly, how do those likelihoods change as a function of our directed effort?
I fail to see how my post was a straw man. I was pointing out a deficiency in what I am supporting, not what you are supporting.
It shouldn’t matter who supports what. If you suddenly agree with me on some topic, you still have to convince me that you did so for the right reasons, and didn’t accept a mistaken argument or mistaken understanding of an argument (see also “belief bias”). If such is to be discovered, you’d have to make a step back, and we both should agree that it’s the right thing to do.
The “strawman” (probably a wrong term in this context) is in making a distinction between “friendliness” and “provable friendliness”. If you accept that the distinction is illusory, the weakness of non-FAI “friendliness” suddenly becomes “provably fatal”.
This system isn’t stable, deviations in preference don’t correct themselves, if the deviated-preference agency has control.
I disagree that we know this. Certainly the system hasn’t stabilized yet, but how can you make such a broad statement about the future evolution of human preference?
Stability is a local property around a specific point, that states that sufficiently small deviations from that point will be followed by corrections back to it, so that the system will indefinitely remain in the close proximity of that point, provided it’s not disturbed too much.
Where we replace ourselves with agency of slightly different preference, this new agency has no reason to correct backwards to our preference. If it is not itself stable (that is, it hasn’t built its own FAI), then the next preference shift it’ll experience (in effectively replacing itself with yet different preference agency) isn’t going to be related to the first shift, isn’t going to correct it. As a result, value is slowly but inevitably lost. This loss of value only stops when the reflective consistency is finally achieved, but it won’t be by an agency that exactly shares your preference. Thus, even when you’ve lost a fight for specifically your preference, the only hope is for the similar-preference drifted agency to stop as soon as possible (as close to your preference as possible), to develop its FAI. (See also: Friendly AI: a vector for human preference.)
My own preferences have changed over the years. I don’t think anyone is arguing this is a bad thing.
The past-you is going to prefer your preference not to change, even though current-you would prefer your preference to be as it now is. Note that preference has little to do with likes or wants, so you might be talking about surface reactions to environment and knowledge, not the eluding concept of what you’d prefer in the limit of reflection. (See also: “Why Socialists don’t Believe in Fun”, Eutopia is Scary.)
The topic is whether FAI or WBE research is better for existential risk reduction. The pertinent question is what are the likelihoods of each leading to what we would consider a positive singularity, and, more importantly, how do those likelihoods change as a function of our directed effort?
And to decide this question, we need a solid understanding of what counts as a success or failure. The concept of preference is an essential tool in gaining this understanding.
There is going to be value drift even if we get an FAI. Isn’t that inherent in extrapolated volition? We don’t really want our current values, we want the values we’ll have after being smarter and having time to think deeply about them. The route of WBE simply takes the guess work out: actually make people smarter, and then see what the drifted values are. Of course, it’s important to keep a large, diverse culture in the process, so that the whole can error correct for individuals that go off the deep end, analogous to why extrapolated volition would be based on the entire human population rather than a single person.
Here is a potentially more productive way of seeing this situation: We do want our current preferences to be made reality (because that’s what the term preference describes), but we do not know what our preferences look like, part of the reason being that we are not smart enough and do not have enough time to think about what they are. In this view, our preferences are not necessarily going to drift if we figure out how to refer to human preference as a formal object and if we build machines that use this object to choose what to do — and in this view, we certainly don’t want our preferences to drift.
On the other hand, WBE does not “simply take the guess work out”. It may be the case that the human mind is built such that “making people smarter” is feasible without changing preference much, but we don’t know that this is the case. As long as we do not have a formal theory of preference, we cannot strongly believe this about any given intervention – and if we do have such a theory, then there exist better uses for this knowledge.
Yes, but one of our preferences may well be that we are open to an evolution of our preferences. And, whether or not that is one of our preferences, it certainly is the cases that preferences do evolve over time, and that many consider that a fundamental aspect of the human condition.
I agree we don’t know that is the case, and would assume that it isn’t.
Any notion of progress (what we want is certainly not evolution) can be captured as a deterministic criterion.
Obviously I meant ‘evolution’ in the sense of change over time, not change specifically induced by natural selection.
As to a deterministic criterion, I agree that such a thing is probably possible. But… so what? I’m not arguing that FAI isn’t possible. The topic at hand is FAI research relative to WBE. I’m assuming a priori that both are possible. The question is which basket should get more eggs.
You said:
This is misuse of the term “preference”. “Preference”, in the context of this discussion, refers specifically to that which isn’t to be changed, ever. This point isn’t supposed to be related to WBE vs. FAI discussion, it’s about a tool (the term “preference”) used in leading this discussion.
Your definition is too narrow for me to accept. Humans are complicated. I doubt we have a core set of “preferences” (by your definition) which can be found with adequate introspection. The very act of introspection itself changes the human and potentially their deepest preferences (normal definition)!
I have some preferences which satisfy your definition, but I wouldn’t consider them my core, underlying preferences. The vast majority of preferences I hold do not qualify. I’m perfectly OK with them changing over time, even the ones that guide the overarching path of my life. Yes, the change in preferences is often caused by other preferences, but to think that this causal chain can be traced back to a core preference is unjustified, in my opinion. There could just as well be closed loops in the causal tree.
You are disputing definitions! Of course, there are other natural ways to give meaning to the word “preference”, but they are not as useful in discussing FAI as the comprehensive unchanging preference. It’s not supposed to have much in common with likes or wants, and with their changes, though it needs to, in particular, describe what they should be, and how they should change. Think of your preference as that particular formal goal system that it is optimal, from your point of view (on reflection, if you knew more, etc.), to give to a Strong AI.
Your dislike for application of the label “preference” to this concept, and ambiguity that might introduce, needs to be separated from consideration of the concept itself.
I specifically dispute the usefulness of your definition. It may be a useful definition in the context of FAI theory. We aren’t discussing FAI theory.
And, to be fair, you were originally the one disputing definitions. In my post I used the standard definition of ‘preference’, which you decided was ‘wrong’, saying
rather than accepting the implied (normal!) definition I had obviously used.
Regardless, it seems unlikely we’ll be making any progress on the on-topic discussion even if we resolve this quibble.
But we do. Whether a particular action is going to end well for humanity is a core consideration in Friendliness. When you say
if it’s read as implying that this road is OK, it is a factual claim about how preferable (in my sense) the outcome is going to be. The concept of preference (in my sense) is central to evaluating the correctness of your factual claim.
Your concept of preference is one way of evaluating the correctness of my claim, I agree. If you can resolve the complex web of human preferences (in my sense) into a clean, non-contradictory, static preference system (your sense) then you can use that system to judge the value of the hypothetical future in which WBE research overran FAI research.
It’s not clear to me that this is the only way to evaluate my claim, or that it is even a reasonable way. My understanding of FAI is that arriving at such a resolution of human preferences is a central ingredient to building an FAI, hence using your method to evaluate my claim would require more progress on FAI. But the entire point of this discussion is to decide if we should be pushing harder for progress on FAI or WBE. I’ll grant that this is a point in favor for FAI—that it allows for a clearer evaluation of the very problem we’re discussing—but, beyond that, I think we must rely on the actual preferences we have access to now (in my sense: the messy, human ones) to further our evaluations of FAI and WBE.
If your statement (“The route of WBE simply takes the guess work out”) were a comparison between two routes similar in approach, e.g. WBE and neuroenhancement, then you could argue that a better formal understanding of preference would be required before we could use the idea of “precise preference” to argue for one approach or the other.
Since we are comparing one option which does not try to capture preference precisely with an option that does, it does not matter what exactly precise preference says about the second option: Whatever statement our precise preferences make, the second option tries to capture it whereas the first option makes no such attempt.
The first option tries to capture our best current guess as to our fundamental preference. It then updates the agent (us) based on that guess. Afterwards the next guess as to our fundamental preference is likely different, so the process iterates. The iteration is trying to evolve towards what the agent thinks is its exact preference. The iteration is simply doing so to some sort of “first order” approximation.
For the first option, I think self-modification under the direction of current, apparent preferences should be done with extreme caution, so as to get a better ‘approximation’ at each step. For the second option though, it’s hard for me to imagine ever choosing to self-modify into an agent with exact, unchanging preferences.
This guess may be awful. The process of emulation and attempts to increase the intelligence of the emulations may introduce subtle psychological changes that could affect the preferences of the persons involved.
For subsequent changes based on “trying to evolve towards what the agent thinks is its exact preference” I see two options: Either they are like the first change, open to the possibility of being arbitrarily awful due to the fact that we do not have much introspective insight into the nature of our preferences, and step by step we lose part of what we value — or subsequent changes consist of the formalization and precise capture of the object preference, in which case the situation must be judged depending on how much value was lost in the first step vs how much value was gained by having emulations work on the project of formalization.
This is not the proposal under discussion. The proposal is to build a tool that ensures that things develop according to our wishes. If it turns out that our preferred (in the exact, static sense) route of development is through a number of systems that are not reflectively consistent themselves, then this route will be realized.
It may be horribly awful, yes. The question is “how likely is it be awful?”
If FAI research can advance fast enough then we will have the luxury of implementing a coherent preference system that will guarantee the long term stability of our exact preferences. In an ideal world that would be the path we took. In the real world there is a downside to the FAI path: it may take too long. The benefit of other paths is that, although they would have some potential to fail even if executed in time, they offer a potentially faster time table.
I’ll reiterate: yes, of course FAI would be better than WBE, if both were available. No, WBE provides no guarantee and could lead to horrendous preference drift. The questions are: how likely is WBE to go wrong? how long is FAI likely to take? how long is WBE likely to take? And, ultimately, combining the answers to those questions together: where should we be directing our research?
Your post points out very well that WBE might go wrong. It gives no clue to the likelihood though.
Good, this is progress. Your comment clarified your position greatly. However, I do not know what you mean by “how long is WBE likely to take?” — take until what happens?
The amount of time until we have high fidelity emulations of human brains. At that point we can start modifying/enhancing humans, seeking to create a superintelligence or at least sufficiently intelligent humans that can then create an FAI. The time from first emulation to superintelligence is nonzero, but is probably small compared to the time to first emulation. If we have reason to believe that the additional time is not small we should factor in our predictions for it as well.
My conclusion from this discussion is that our disagreement lies in the probability we assign that uploads can be applied safely to FAI as opposed to generating more existential risk. I do not see how to resolve this disagreement right now. I agree with your statement that we need to make sure that those involved in running uploads understand the problem of preserving human preference.
I’m not entirely sure how to resolve that either. However, it isn’t necessary for us to agree on that probability to agree on a course of action.
What probability would you assign to uploads being used safely? What do your probability distributions look like for the ETA of uploads, FAI and AGI?
We do understand something about exact preferences in general, without knowing which one of them is ours. In particular, we do know that drifting from whatever preference we have is not preferable.
I agree. If our complex preferences can be represented as exact preferences then any drift from those exact preferences would be necessarily bad. However, it’s not clear to me that we actually would be drifting from our exact preference were we to follow the path of WBE.
It’s clear that the preferences we currently express most likely aren’t our exact preferences. The path of WBE could potentially lead to humans with fundamentally different exact preferences (bad), or it could simply lead to humans with the same exact preferences but with a different, closer expression of them in the surface preferences they actually present and are consciously aware of (good). Or the path could lead to someplace in between, obviously. Any drift is bad, I agree, but small enough drift could be acceptable if the trade off is good enough (such as preventing a negative singularity).
By the way, I move to label your definition “exact preference” and mine “complex preference”. Unless the context is clear, in which case we can just write “preference”. Thoughts?
You are right, I was wrong to claim authority over the meaning of the term as you used it. The actual problem was in you misinterpreting its use in andreas’s comment, where it was used in my sense:
No. Progress and development may be part of human preference, but it is entirely OK for a fixed preference to specify progress happening in a particular way, as opposed to other possible ways. Furthermore, preference can be fixed and still not knowable in advance (so that there are no spoilers, and moral progress happens through your effort and not dictated “from above”).
It’s not possible to efficiently find out some properties of a program, even if you have its whole source code; this source code doesn’t change, but the program runs—develops—in novel and unexpected ways. Or course, the unexpected needs to be knowably good, not just “unexpected” (see for example Expected Creative Surprises).
I agree that such a fixed preference system is possible. But I don’t think that it needs to be implemented in order for “moral progress” to be indefinitely sustainable in a positive fashion. I think humans are capable of guiding their own moral progress without their hands being held. Will the result be provably friendly? No, of course not. The question is how likely is the result to be friendly, and is this likelihood great enough that it offsets the negatives associated with FAI research (namely the potentially very long timescales needed).
The strawman of “provable friendliness” again. It’s not about holding ourselves to an inadequately high standard, it’s about figuring out what’s going on, in any detail. (See this comment.)
If we accept that preference is complex (holds a lot of data), and that detail in preference matters (losing a relatively small portion of this data is highly undesirable), then any value drift is bad, and while value drift is not rigorously controlled, it’s going to lead its random walk further and further away from the initial preference. As a result, from the point of view of the initial preference, the far future is pretty much lost, even if each individual step of the way doesn’t look threatening. The future agency won’t care about the past preference, and won’t reverse to it, because as a result of value drift it already has different preference, and for it returning to the past is no longer preferable. This system isn’t stable, deviations in preference don’t correct themselves, if the deviated-preference agency has control.
I fail to see how my post was a straw man. I was pointing out a deficiency in what I am supporting, not what you are supporting.
I disagree that we know this. Certainly the system hasn’t stabilized yet, but how can you make such a broad statement about the future evolution of human preference? And, in any case, even if there were no ultimate attractor in the system, so what? Human preferences have changed over the centuries. My own preferences have changed over the years. I don’t think anyone is arguing this is a bad thing. Certainly, we may be able to build a system that replaces our “sloppy” method of advancement for a deterministic system with an immutable set of preferences at its core. I disagree this is necessarily superior to letting preferences evolve in the same way they have been, free of an overseer. But that disagreement of ours is still off topic.
The topic is whether FAI or WBE research is better for existential risk reduction. The pertinent question is what are the likelihoods of each leading to what we would consider a positive singularity, and, more importantly, how do those likelihoods change as a function of our directed effort?
It shouldn’t matter who supports what. If you suddenly agree with me on some topic, you still have to convince me that you did so for the right reasons, and didn’t accept a mistaken argument or mistaken understanding of an argument (see also “belief bias”). If such is to be discovered, you’d have to make a step back, and we both should agree that it’s the right thing to do.
The “strawman” (probably a wrong term in this context) is in making a distinction between “friendliness” and “provable friendliness”. If you accept that the distinction is illusory, the weakness of non-FAI “friendliness” suddenly becomes “provably fatal”.
Stability is a local property around a specific point, that states that sufficiently small deviations from that point will be followed by corrections back to it, so that the system will indefinitely remain in the close proximity of that point, provided it’s not disturbed too much.
Where we replace ourselves with agency of slightly different preference, this new agency has no reason to correct backwards to our preference. If it is not itself stable (that is, it hasn’t built its own FAI), then the next preference shift it’ll experience (in effectively replacing itself with yet different preference agency) isn’t going to be related to the first shift, isn’t going to correct it. As a result, value is slowly but inevitably lost. This loss of value only stops when the reflective consistency is finally achieved, but it won’t be by an agency that exactly shares your preference. Thus, even when you’ve lost a fight for specifically your preference, the only hope is for the similar-preference drifted agency to stop as soon as possible (as close to your preference as possible), to develop its FAI. (See also: Friendly AI: a vector for human preference.)
The past-you is going to prefer your preference not to change, even though current-you would prefer your preference to be as it now is. Note that preference has little to do with likes or wants, so you might be talking about surface reactions to environment and knowledge, not the eluding concept of what you’d prefer in the limit of reflection. (See also: “Why Socialists don’t Believe in Fun”, Eutopia is Scary.)
And to decide this question, we need a solid understanding of what counts as a success or failure. The concept of preference is an essential tool in gaining this understanding.