FAI doesn’t do what you do; it optimizes its strategy according to preference. It’s more able than a human to form better strategies according to a given preference, and even failing that it still has to be able to avoid value drift (as a minimum requirement).
Preference is never seen completely, there is always loads of logical uncertainty about it. The point of creating a FAI is in fixing the preference so that it stops drifting, so that the problem that is being solved is held fixed, even though solving it will take the rest of eternity; and in creating a competitive preference-optimizing agent that ensures the preference to fair OK against possible threats, including different-preference agents or value-drifted humanity.
Preference isn’t defined by an agent’s strategy, so copying a human without some kind of self-reflection I don’t understand is pretty pointless. Since I never described a way of extracting preference from a human (and hence defining it for a FAI), I’m not sure where do you see the regress in the process of defining preference.
FAI is not built without exact and complete definition of preference. The uncertainty about preference can only be logical, in what it means/implies. (At least, when we are talking about syntactic preference, where the rest of the world is necessarily screened off.)
Since I never described a way of extracting preference from a human (and hence defining it for a FAI), I’m not sure where do you see the regress in the process of defining preference.
Reading your previous post in this thread, I felt like I was missing something and I could have asked the question Wei Dai asked (“Once we implement this kind of FAI, how will we be better off than we are today?”). You did not explicitly describe a way of extracting preference from a human, but phrases like “if you manage to represent your preference in terms of your I/O” made it seem like capturing strategy was what you had in mind.
I now understand you as talking only about what kind of object preference is (an I/O map) and about how this kind of object can contain certain preferences that we worry might be lost (like considerations of faulty hardware). You have not said anything about what kind of static analysis would take you from an agent’s strategy program to an agent’s preference.
After reading Nesov’s latestposts on the subject, I think I better understand what he is talking about now. But I still don’t get why Nesov seems confident that this is the right approach, as opposed to a possible one that is worth looking into.
You [Nesov] have not said anything about what kind of static analysis would take you from an agent’s program to an agent’s [syntactic] preference.
Do we have at least an outline of how such an analysis would work? If not, why do we think that working out such an analysis would be any easier than, say, trying to state ourselves what our “semantic” preferences are?
But I still don’t get why Nesov seems confident that this is the right approach, as opposed to a possible one that is worth looking into.
What other approaches do you refer to? This is just the direction my own research has taken. I’m not confident it will lead anywhere, but it’s the best road I know about.
Do we have at least an outline of how such an analysis would work? If not, why do we think that working out such an analysis would be any easier than, say, trying to state ourselves what our “semantic” preferences are?
I have some ideas, though too vague to usefully share (I wrote about a related idea on the SIAI decision theory list, replying to Drescher’s bounded Newcomb variant, where a dependence on strategy is restored from a constant syntactic expression in terms of source code). For “semantic preference”, we have the ontology problem, which is a complete show-stopper. (Though as I wrote before, interpretations of syntactic preference in terms of formal “possible worlds”—now having nothing to do with the “real world”—are a useful tool, and it’s the topic of the next blog post.)
At this point, syntactic preference (1) solves the ontology problem, (2) gives focus to investigation of what kind of mathematical structure could represent preference (strategy is a well-understood mathematical structure, and syntactic preference is something allowing to compute a strategy, with better strategies resulting from more computation), and (3) gives a more technical formulation of the preference extraction problem, so that we can think about it more clearly. I don’t know of another effort towards clarifying/developing preference theory (that reaches even this meager level of clarity).
If not, why do we think that working out such an analysis would be any easier than, say, trying to state ourselves what our “semantic” preferences are?
Returning to this point, there are two show-stopping problems: first, as I pointed out above, there is the ontology problem: even if humans were able to write out their preference, the ontology problem makes the product of such an effort rather useless; second, we do know that we can’t write out our preference manually. Figuring out an algorithmic trick for extracting it from human minds automatically is not out of the question, hence worth pursuing.
P.S. These are important questions, and I welcome this kind of discussion about general sanity of what I’m doing or claiming; I only saw this comment because I’m subscribed to your LW comments.
Why do you consider the ontology problem to be a complete show-stopper? It seems to me there are at least two other approaches to it that we can take:
We human beings seem to manage to translate our preferences from one ontology to another when necessary, so try to figure out how we do that, and program it into the FAI.
Work out what the true, correct ontology is, then translate our preferences into that ontology. It seems that we already have a good candidate of this in the form of “all mathematical structures”. Formalizing that notion seems really hard, but why should it be impossible?
You claim that syntactic preference solves the ontology problem, but I have even fewer ideas about how to extract the syntactic preference of arbitrary programs. You mention that you do have some vague ideas, so I guess I’ll just have to be patient and let you work them out.
second, we do know that we can’t write out our preference manually.
How do we know that? It’s not clear to me that there is any more evidence for “we can’t write out our preferences manually”, than for “we can’t build an artificial general intelligence manually”.
I only saw this comment because I’m subscribed to your LW comments.
Why do you consider the ontology problem to be a complete show-stopper? It seems to me there are at least two other approaches to it that we can take:
By “show-stopper” I simply mean that we absolutely have to solve it in some way. Syntactic preference is one way, what you suggest could conceivably be another.
You claim that syntactic preference solves the ontology problem, but I have even fewer ideas about how to extract the syntactic preference of arbitrary programs.
An advantage I see with syntactic preference is that it’s at least more or less clear what are we working with: formal programs and strategies. This opens the whole palette of possible approaches to the remaining problems to try on. With “all mathematical structures” thing, we still don’t know what we are supposed to talk about, there is as of now no way forward already at that step. At least syntactic preference allows to make one step further, to firmer ground, even though admittedly it’s unclear what to do next.
second, we do know that we can’t write out our preference manually.
How do we know that? It’s not clear to me that there is any more evidence for “we can’t write out our preferences manually”, than for “we can’t build an artificial general intelligence manually”.
I mean the “complexity of value”/”value is fragile” thesis. It seems to me quite convincing, and from the opposite direction, I have the “preference is detailed” conjecture resulting from the nature of preference in general. For “is it possible to build AI”, we don’t have similarly convincing arguments (and really, it’s an unrelated claim that only contributes connotation of error in judgment, without giving an analogy in the method of arriving at that judgment).
I mean the “complexity of value”/”value is fragile” thesis.
I agree with “complexity of value” in the sense that human preference, as a mathematical object, has high information content. But I don’t see a convincing argument from this premise to the conclusion that the best course of action for us to take, in the sense of maximizing our values under the constraints that we’re likely to face, involves automated extraction of preferences, instead of writing them down manually.
Consider the counter-example of someone who has the full complexity of human values, but would be willing to give up all of their other goals to fill the universe with orgasmium, if that choice were available. Such an agent could “win” by building a superintelligence with just that one value. How do we know, at this point, that our values are not like that?
Whatever the case is with how acceptable the simplified values are, automated extraction of preference seems to be the only way to actually knowably win, rather than striking a compromise, which simplified preference is suggested to be. We must decide from information we have; how would you come to know that a particular simplified preference definition is any good? I don’t see a way forward without having a more precise moral machine than a human first (but then, we won’t need to consider simplified preference).
I now understand you as talking only about what kind of object preference is (an I/O map) and about how this kind of object can contain certain preferences that we worry might be lost (like considerations of faulty hardware).
Correct. Note that “strategy” is a pretty standard term, while “I/O map” sounds ambiguous, though it emphasizes that everything except the behavior at I/O is disregarded.
You have not said anything about what kind of static analysis would take you from an agent’s strategy to an agent’s preference.
An agent is more than its strategy: strategy is only external behavior, normal form of the algorithm implemented in the agent. The same strategy can be implemented by many different programs. I strongly suspect that it takes more than a strategy to define preference, that introspective properties are important (how the behavior is computed, as opposed to just what the resulting behavior is). It is sufficient for preference, when it is defined, to talk about strategies, and disregard how they could be computed; but to define (extract) a preference, a single strategy may be insufficient, it may be necessary to look at how the reference agent (e.g. a human) works on the inside. Besides, the agent is never given as its strategy, it is given as its source code that normalizes to that strategy, and computing the strategy may be tough (and pointless).
FAI doesn’t do what you do; it optimizes its strategy according to preference. It’s more able than a human to form better strategies according to a given preference, and even failing that it still has to be able to avoid value drift (as a minimum requirement).
Preference is never seen completely, there is always loads of logical uncertainty about it. The point of creating a FAI is in fixing the preference so that it stops drifting, so that the problem that is being solved is held fixed, even though solving it will take the rest of eternity; and in creating a competitive preference-optimizing agent that ensures the preference to fair OK against possible threats, including different-preference agents or value-drifted humanity.
Preference isn’t defined by an agent’s strategy, so copying a human without some kind of self-reflection I don’t understand is pretty pointless. Since I never described a way of extracting preference from a human (and hence defining it for a FAI), I’m not sure where do you see the regress in the process of defining preference.
FAI is not built without exact and complete definition of preference. The uncertainty about preference can only be logical, in what it means/implies. (At least, when we are talking about syntactic preference, where the rest of the world is necessarily screened off.)
Reading your previous post in this thread, I felt like I was missing something and I could have asked the question Wei Dai asked (“Once we implement this kind of FAI, how will we be better off than we are today?”). You did not explicitly describe a way of extracting preference from a human, but phrases like “if you manage to represent your preference in terms of your I/O” made it seem like capturing strategy was what you had in mind.
I now understand you as talking only about what kind of object preference is (an I/O map) and about how this kind of object can contain certain preferences that we worry might be lost (like considerations of faulty hardware). You have not said anything about what kind of static analysis would take you from an agent’s
strategyprogram to an agent’s preference.After reading Nesov’s latest posts on the subject, I think I better understand what he is talking about now. But I still don’t get why Nesov seems confident that this is the right approach, as opposed to a possible one that is worth looking into.
Do we have at least an outline of how such an analysis would work? If not, why do we think that working out such an analysis would be any easier than, say, trying to state ourselves what our “semantic” preferences are?
What other approaches do you refer to? This is just the direction my own research has taken. I’m not confident it will lead anywhere, but it’s the best road I know about.
I have some ideas, though too vague to usefully share (I wrote about a related idea on the SIAI decision theory list, replying to Drescher’s bounded Newcomb variant, where a dependence on strategy is restored from a constant syntactic expression in terms of source code). For “semantic preference”, we have the ontology problem, which is a complete show-stopper. (Though as I wrote before, interpretations of syntactic preference in terms of formal “possible worlds”—now having nothing to do with the “real world”—are a useful tool, and it’s the topic of the next blog post.)
At this point, syntactic preference (1) solves the ontology problem, (2) gives focus to investigation of what kind of mathematical structure could represent preference (strategy is a well-understood mathematical structure, and syntactic preference is something allowing to compute a strategy, with better strategies resulting from more computation), and (3) gives a more technical formulation of the preference extraction problem, so that we can think about it more clearly. I don’t know of another effort towards clarifying/developing preference theory (that reaches even this meager level of clarity).
Returning to this point, there are two show-stopping problems: first, as I pointed out above, there is the ontology problem: even if humans were able to write out their preference, the ontology problem makes the product of such an effort rather useless; second, we do know that we can’t write out our preference manually. Figuring out an algorithmic trick for extracting it from human minds automatically is not out of the question, hence worth pursuing.
P.S. These are important questions, and I welcome this kind of discussion about general sanity of what I’m doing or claiming; I only saw this comment because I’m subscribed to your LW comments.
Why do you consider the ontology problem to be a complete show-stopper? It seems to me there are at least two other approaches to it that we can take:
We human beings seem to manage to translate our preferences from one ontology to another when necessary, so try to figure out how we do that, and program it into the FAI.
Work out what the true, correct ontology is, then translate our preferences into that ontology. It seems that we already have a good candidate of this in the form of “all mathematical structures”. Formalizing that notion seems really hard, but why should it be impossible?
You claim that syntactic preference solves the ontology problem, but I have even fewer ideas about how to extract the syntactic preference of arbitrary programs. You mention that you do have some vague ideas, so I guess I’ll just have to be patient and let you work them out.
How do we know that? It’s not clear to me that there is any more evidence for “we can’t write out our preferences manually”, than for “we can’t build an artificial general intelligence manually”.
I had a hunch that might be the case. :)
By “show-stopper” I simply mean that we absolutely have to solve it in some way. Syntactic preference is one way, what you suggest could conceivably be another.
An advantage I see with syntactic preference is that it’s at least more or less clear what are we working with: formal programs and strategies. This opens the whole palette of possible approaches to the remaining problems to try on. With “all mathematical structures” thing, we still don’t know what we are supposed to talk about, there is as of now no way forward already at that step. At least syntactic preference allows to make one step further, to firmer ground, even though admittedly it’s unclear what to do next.
I mean the “complexity of value”/”value is fragile” thesis. It seems to me quite convincing, and from the opposite direction, I have the “preference is detailed” conjecture resulting from the nature of preference in general. For “is it possible to build AI”, we don’t have similarly convincing arguments (and really, it’s an unrelated claim that only contributes connotation of error in judgment, without giving an analogy in the method of arriving at that judgment).
I agree with “complexity of value” in the sense that human preference, as a mathematical object, has high information content. But I don’t see a convincing argument from this premise to the conclusion that the best course of action for us to take, in the sense of maximizing our values under the constraints that we’re likely to face, involves automated extraction of preferences, instead of writing them down manually.
Consider the counter-example of someone who has the full complexity of human values, but would be willing to give up all of their other goals to fill the universe with orgasmium, if that choice were available. Such an agent could “win” by building a superintelligence with just that one value. How do we know, at this point, that our values are not like that?
Whatever the case is with how acceptable the simplified values are, automated extraction of preference seems to be the only way to actually knowably win, rather than striking a compromise, which simplified preference is suggested to be. We must decide from information we have; how would you come to know that a particular simplified preference definition is any good? I don’t see a way forward without having a more precise moral machine than a human first (but then, we won’t need to consider simplified preference).
Correct. Note that “strategy” is a pretty standard term, while “I/O map” sounds ambiguous, though it emphasizes that everything except the behavior at I/O is disregarded.
An agent is more than its strategy: strategy is only external behavior, normal form of the algorithm implemented in the agent. The same strategy can be implemented by many different programs. I strongly suspect that it takes more than a strategy to define preference, that introspective properties are important (how the behavior is computed, as opposed to just what the resulting behavior is). It is sufficient for preference, when it is defined, to talk about strategies, and disregard how they could be computed; but to define (extract) a preference, a single strategy may be insufficient, it may be necessary to look at how the reference agent (e.g. a human) works on the inside. Besides, the agent is never given as its strategy, it is given as its source code that normalizes to that strategy, and computing the strategy may be tough (and pointless).