a test for those who propose to “extract” or “extrapolate” our preferences into a well-defined and rational form
If we are going to have a serious discussion about these matters, at some point we must face the fact that the physical description of the world contains no such thing as a preference or a want—or a utility function. So the difficulty of such extractions or extrapolations is twofold. Not only is the act of extraction or extrapolation itself conditional upon a value system (i.e. normative metamorality is just as “relative” as is basic morality), but there is nothing in the physical description to tell us what the existing preferences of an agent are. Given the physical ontology we have, the ascription of preferences to a physical system is always a matter of interpretation or imputation, just as is the ascription of semantic or representational content to its states.
It’s easy to miss this in a decision-theoretic discussion, because decision theory already assumes some concept like “goal” or “utility”, always. Decision theory is the rigorous theory of decision-making, but it does not tell you what a decision is. It may even be possible to create a rigorous “reflective decision theory” which tells you how a decision architecture should choose among possible alterations to itself, or a rigorous theory of normative metamorality, the general theory of what preferences agents should have towards decision-architecture-modifying changes in other agents. But meta-decision theory will not bring you any closer to finding “decisions” in an ontology that doesn’t already have them.
I agree this is part of the problem, but like others here I think you might be making it out to be harder than it is. We know, in principle, how to translate a utility function into a physical description of an object: by coding it as an AI and then specifying the AI along with its substrate down to the quantum level. So, again in principle, we can go backwards: take a physical description of an object, consider all possible implementations of all possible utility functions, and see if any of them matches the object.
We know, in principle, how to translate a utility function into a physical description of an object: by coding it as an AI and then specifying the AI along with its substrate down to the quantum level. So, again in principle, we can go backwards: take a physical description of an object, consider all possible implementations of all possible utility functions, and see if any of them matches the object.
I think it’s enough to consider computer programs and dispense with details of physics—everything else can be discovered by the program. You are assuming the “bottom” level of physics, “quantum level”, but there is no bottom, not really, there is only the beginning where our own minds are implemented, and the process of discovery that defines the way we see the rest of the world.
If you start with an AI design parameterized by preference, you are not going to enumerate all programs, only a small fraction of programs that have the specific form of your AI with some preference, and so for a given arbitrary program there will be no match. Furthermore, you are not interested in finding a match: if a human was equal to the AI, you are already done! It’s necessary to explicitly go the other way, starting from arbitrary programs and understanding what a program is, deeply enough to see preference in it. This understanding may give an idea of a mapping for translating a crazy ape into an efficient FAI.
If you start with an AI design parameterized by preference, you are not going to enumerate all programs, only a small fraction of programs that have the specific form of your AI with some preference, and so for a given arbitrary program there will be no match.
When I said “all possible implementations of all possible utility functions”, I meant to include flawed implementations. But then two different utility functions might map onto the same physical object, so we’d also need a theory of implementation flaws that tells us, given two implementations of a utility function, which is more flawed.
When I said “all possible implementations of all possible utility functions”, I meant to include flawed implementations. But then two different utility functions might map onto the same physical object, so we’d also need a theory of implementation flaws that tells us, given two implementations of a utility function, which is more flawed.
This is WAY too hand-wavy an explanation for “in principle, we can go backwards” (from a system to its preference). I believe that in principle, we can, but not via injecting fuzziness of “implementation flaws”.
Here’s another statement of the problem: One agent’s bias is another agent’s heuristic. And the “two agents” might be physically the same, but just interpreted differently.
Given the physical ontology we have, the ascription of preferences to a physical system is always a matter of interpretation or imputation, just as is the ascription of semantic or representational content to its states.
There are clear cut cases, like a thermostat, where the physics of the system is well-approximated by a function that computes the degree of difference between the actual measured state of the world and a “desired state”. In these clear cut cases, it isn’t a matter of opinion or interpretation. Basically, echoing Nesov.
Thus, the criterion for ascribing preferences to a physical system is that the actual physics has to be well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”.
Thus, the criterion for ascribing preferences to a physical system is that the actual physics has to be well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”.
I don’t think this simple characterisation resembles the truth: the whole point of this enterprise is to make sure things go differently, in a way they just couldn’t proceed by themselves. Thus, observing existing “tendencies” doesn’t quite capture the idea of preference.
make sure things go differently, in a way they just couldn’t proceed by themselves. Thus, observing existing “tendencies” doesn’t quite capture the idea of preference.
I should have been clearer: you have to draw a boundary around the “optimizing agent”, and look at the difference between the tendencies of the environment without the optimizer, and the tendencies of the environment with the optimizer. If the difference is well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”, then you have an optimizer.
I don’t hear differently… I even suspect that preference is introspective, that is depends on a way the system works “internally”, not just on how it interacts with environment. That is, two agents with different preferences may do exactly the same thing in all contexts. Even if not, it’s a long way between how the agent (in its craziness and stupidity) actually changes the environment, and how it would prefer (on reflection, if it was smarter and saner) the environment to change.
Even if not, it’s a long way between how the agent (in its craziness and stupidity) actually changes the environment, and how it would prefer (on reflection, if it was smarter and saner) the environment to change
That is true. If the agent has a well-defined “predictive module” which has a “map” (probability distribution over the environment given an interaction history), and some “other stuff”, then you can clamp the predictive module down to the truth, and then perform what I said before:
look at the difference between the tendencies of the environment without the optimizer, and the tendencies of the environment with the optimizer. If the difference is well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”, then you have an optimizer.
And you probably also want to somehow formalize the idea that there is a difference between what an agent will try to achieve if it has only limited means—e.g. a lone human in a forest with no tools, clothes or other humans—and what the agent will try to achieve with more powerful means—e.g, with machinery and tools, or in the limit, with a whole technological infrastructure, and unlimited computing power at it’s disposal.
I want to point out that in the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, we can’t just replace “incorrect” beliefs with “the truth”. In this interpretation, there can still be errors in one’s beliefs caused by things like past computational mistakes, and I think fixing those errors would constitute helping, but the prior perhaps needs to be preserved as part of preference.
I agree that the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, things get more complicated.
Actually, we had a discussion about my discomfort with your interpretation, and it seems that in order for me to see why you endorse this interpretation, I’d have to read up on various paradoxes, e.g. sleeping beauty.
If the agent has a well-defined “predictive module” which has a “map” (probability distribution over the environment given an interaction history), and some “other stuff”, then you can clamp the predictive module down to the truth, and then perform what I said before:
Comment by Ricky Loynd
Jun 23, 2007 7:39 am
Here’s my attempt to summarize a common point that Roko and I are trying to make. The underlying motivation for extrapolating volition sounds reasonable, but it depends critically on the AI’s ability to distinguish between goals and beliefs, between preferences and expectations, so that it can model human goals and preferences while substituting its own correct beliefs and expectations. But when you start dissecting most human goals and preferences, you find they contain deeper layers of belief and expectation. If you keep stripping those away, you eventually reach raw biological drives which are not a human belief or expectation. (Though even they are beliefs and expectations of evolution, but let’s ignore that for the moment.)
Once you strip away human beliefs and expectations, nothing remains but biological drives, which even the animals have. Yes, an animal, by virtue of its biological drives and ability to act, is more than a predicting rock, but that doesn’t address the issue at hand.
Why is it a tragedy when a loved one dies? Is it because the world no longer contains their particular genetic weighting of biological drives? Of course not. After all, they may have left an identical twin to carry forward the very same genetic combination. But it’s not the biology that matters to us. We grieve because what really made that person a person is now gone, and that’s all in the brain; the shared experiences, their beliefs whether correct or mistaken or indeterminate, their hopes and dreams, all those things that separate humans from animals, and indeed, that separate one human from most other humans. All that the brain absorbs and becomes throughout the course of a life, we call the soul, and we see it as our very humanity, that big, messy probability distribution describing our accumulated beliefs and expectations about ourselves, the universe, and our place in it.
So if the AI models a human while substituting its own beliefs and anticipations of future experiences, then the AI has discarded all that we value in each other. UNLESS you draw a line somewhere, and crisply define which human beliefs get replaced and which ones don’t. Constructing toy examples where such a line is possible to imagine does not mean that the distinction can be made in any general way, but CEV absolutely requires that there be a concrete distinction.
Constructing toy examples where such a line is possible to imagine does not mean that the distinction can be made in any general way, but CEV absolutely requires that there be a concrete distinction.
Basically, CEV works to the extent that there exists a belief/desire separation in a given person. In the thread on the SIAI blog, I posted certain cases where human goals are founded on false beliefs or logically inconsistent thinking, sometimes in complex ways. What is left of the time cube guy once you subtract off his false beliefs and delusions? Not much, probably. The guy is effectively not salvageable, because his identity and values are probably so badly tangled up with the false beliefs that there is no principled way to untangle them, no unique way of extrapolating him that should be considered “correct”.
What is left of the time cube guy once you subtract off his false beliefs and delusions? Not much, probably.
Beware: you are making a common sense-based prediction about what would be the output of a process that you don’t even have the right concepts for specifying! (See my reply to your other comment.)
Wow. Too bad I missed this when it was first posted. It’s what I wish I’d said when justifying my reply to Wei_Dai’s attempted belief/values dichotomy here and here.
I don’t fully agree with Ricky here, but I think he makes a half-good point.
The ungood part of his comment—and mine—is that you can only do your best. If certain people’s minds are too messed up to actually extract values from, then they are just not salvageable. My mind definitely has values that are belief-independent, though perhaps not all of what I think of as “my values” have this nice property, so ultimately they might be garbage.
Indeed. Most of the FAI’s job could consist of saying, “Okay, there’s soooooo much I have to disentangle and correct before I can even begin to propose solutions. Sit down and let’s talk.”
Comment by Eliezer Yudkowsky Jun 18, 2007 12:52 pm: I furthermore agree that it is not the most elegant idea I have ever had, but then it is trying to solve what appears to be an inherently inelegant problem.
I strongly agree with this: the problem that CEV is the solution to is urgent but it isn’t elegant. Absolutes like “There isn’t a beliefs/desires separation” are unhelpful when solving such inelegant but important problems. There is, in any given person, some kind of separation, and in some people that separation is sufficiently strong that there is a fairly clear and unique way to help them.
I strongly agree with this: the problem that CEV is the solution to is urgent but it isn’t elegant. Absolutes like “There isn’t a beliefs/desires separation” are unhelpful when solving such inelegant but important problems.
One lesson of reductionism and success of simple-laws-based science and technology is that for the real-world systems, there might be no simple way of describing them, but there could be a simple way of manipulating their data-rich descriptions. (What’s the yield strength of a car? -- Wrong question!) Given a gigabyte’s worth of problem statement and the right simple formula, you could get an answer to your query. There is a weak analogy with misapplication of Occam’s razor where one tries to reduce the amount of stuff rather than the amount of detail in the ways of thinking about this stuff.
In the case of beliefs/desires separation, you are looking for a simple problem statement, for a separation in the data describing the person itself. But what you should be looking for is a simple way of implementing the make-smarter-and-better extrapolation on a given pile of data. The beliefs/desires separation, if it’s ever going to be made precise, is going to reside in the structure of this simple transformation, not in the people themselves.
Of course, it would be nice if we could find a general “make-smarter-and-better extrapolation on a given pile of data” algorithm.
But on the other hand, a set of special cases to deal with merely human minds might be the way forward. Even medieval monks had a collection of empirically validated medical practices that worked to an extent, e.g. herbal medicine, but they had no unified theory. Really there is no “unified theory” for healing someone’s body: there are lots of ideas and techniques, from surgery to biochemistry to germ theory. I think that this CEV problem may well turn out to be rather like medicine. Of course, it could look more like wing design, where there is really just one fundamental set of laws, and all else is approximation.
[Y]ou have to draw a boundary around the “optimizing agent”, and look at the difference between the tendencies of the environment without the optimizer, and the tendencies of the environment with the optimizer.
And there’s your “opinion or interpretation”—not just in how you draw the boundary (which didn’t exist in the original ontology), but in your choice of the theory that you use to evaluate your counterfactuals.
Of course, such theories can be better or worse, but only with respect to some prior system of evaluation.
Given the physical ontology we have, the ascription of preferences to a physical system is always a matter of interpretation or imputation, just as is the ascription of semantic or representational content to its states.
But to what extent does the result depend on the initial “seed” of interpretation? Maybe, very little. For example, prediction of behavior of a given physical system strictly speaking rests on the problem of induction, but that doesn’t exactly say that anything goes or that what will actually happen is to any reasonable extent ambiguous.
I’d use some loose scale where the quality of the comment correlated with the amount of upvotes it got. Assuming that a user could give up to two upvotes per comment, then a funny one-liner or a moderately interesting comment would get one vote, truly insightful ones two.
p(Kaj would upvote a comment twice | he upvoted it once) would probably be somewhere around [.3, .6]
If we are going to have a serious discussion about these matters, at some point we must face the fact that the physical description of the world contains no such thing as a preference or a want—or a utility function. So the difficulty of such extractions or extrapolations is twofold. Not only is the act of extraction or extrapolation itself conditional upon a value system (i.e. normative metamorality is just as “relative” as is basic morality), but there is nothing in the physical description to tell us what the existing preferences of an agent are. Given the physical ontology we have, the ascription of preferences to a physical system is always a matter of interpretation or imputation, just as is the ascription of semantic or representational content to its states.
It’s easy to miss this in a decision-theoretic discussion, because decision theory already assumes some concept like “goal” or “utility”, always. Decision theory is the rigorous theory of decision-making, but it does not tell you what a decision is. It may even be possible to create a rigorous “reflective decision theory” which tells you how a decision architecture should choose among possible alterations to itself, or a rigorous theory of normative metamorality, the general theory of what preferences agents should have towards decision-architecture-modifying changes in other agents. But meta-decision theory will not bring you any closer to finding “decisions” in an ontology that doesn’t already have them.
I agree this is part of the problem, but like others here I think you might be making it out to be harder than it is. We know, in principle, how to translate a utility function into a physical description of an object: by coding it as an AI and then specifying the AI along with its substrate down to the quantum level. So, again in principle, we can go backwards: take a physical description of an object, consider all possible implementations of all possible utility functions, and see if any of them matches the object.
I think it’s enough to consider computer programs and dispense with details of physics—everything else can be discovered by the program. You are assuming the “bottom” level of physics, “quantum level”, but there is no bottom, not really, there is only the beginning where our own minds are implemented, and the process of discovery that defines the way we see the rest of the world.
If you start with an AI design parameterized by preference, you are not going to enumerate all programs, only a small fraction of programs that have the specific form of your AI with some preference, and so for a given arbitrary program there will be no match. Furthermore, you are not interested in finding a match: if a human was equal to the AI, you are already done! It’s necessary to explicitly go the other way, starting from arbitrary programs and understanding what a program is, deeply enough to see preference in it. This understanding may give an idea of a mapping for translating a crazy ape into an efficient FAI.
When I said “all possible implementations of all possible utility functions”, I meant to include flawed implementations. But then two different utility functions might map onto the same physical object, so we’d also need a theory of implementation flaws that tells us, given two implementations of a utility function, which is more flawed.
This is WAY too hand-wavy an explanation for “in principle, we can go backwards” (from a system to its preference). I believe that in principle, we can, but not via injecting fuzziness of “implementation flaws”.
Here’s another statement of the problem: One agent’s bias is another agent’s heuristic. And the “two agents” might be physically the same, but just interpreted differently.
There are clear cut cases, like a thermostat, where the physics of the system is well-approximated by a function that computes the degree of difference between the actual measured state of the world and a “desired state”. In these clear cut cases, it isn’t a matter of opinion or interpretation. Basically, echoing Nesov.
Thus, the criterion for ascribing preferences to a physical system is that the actual physics has to be well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”.
I don’t think this simple characterisation resembles the truth: the whole point of this enterprise is to make sure things go differently, in a way they just couldn’t proceed by themselves. Thus, observing existing “tendencies” doesn’t quite capture the idea of preference.
I should have been clearer: you have to draw a boundary around the “optimizing agent”, and look at the difference between the tendencies of the environment without the optimizer, and the tendencies of the environment with the optimizer. If the difference is well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”, then you have an optimizer.
I don’t hear differently… I even suspect that preference is introspective, that is depends on a way the system works “internally”, not just on how it interacts with environment. That is, two agents with different preferences may do exactly the same thing in all contexts. Even if not, it’s a long way between how the agent (in its craziness and stupidity) actually changes the environment, and how it would prefer (on reflection, if it was smarter and saner) the environment to change.
That is true. If the agent has a well-defined “predictive module” which has a “map” (probability distribution over the environment given an interaction history), and some “other stuff”, then you can clamp the predictive module down to the truth, and then perform what I said before:
And you probably also want to somehow formalize the idea that there is a difference between what an agent will try to achieve if it has only limited means—e.g. a lone human in a forest with no tools, clothes or other humans—and what the agent will try to achieve with more powerful means—e.g, with machinery and tools, or in the limit, with a whole technological infrastructure, and unlimited computing power at it’s disposal.
I want to point out that in the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, we can’t just replace “incorrect” beliefs with “the truth”. In this interpretation, there can still be errors in one’s beliefs caused by things like past computational mistakes, and I think fixing those errors would constitute helping, but the prior perhaps needs to be preserved as part of preference.
I agree that the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, things get more complicated.
Actually, we had a discussion about my discomfort with your interpretation, and it seems that in order for me to see why you endorse this interpretation, I’d have to read up on various paradoxes, e.g. sleeping beauty.
Yeah, maybe. But it doesn’t.
Yeah, I mean this discussion is—rather amusingly—rather reminiscient of my first encounter with the CEV problem 2.5 years ago.
Basically, CEV works to the extent that there exists a belief/desire separation in a given person. In the thread on the SIAI blog, I posted certain cases where human goals are founded on false beliefs or logically inconsistent thinking, sometimes in complex ways. What is left of the time cube guy once you subtract off his false beliefs and delusions? Not much, probably. The guy is effectively not salvageable, because his identity and values are probably so badly tangled up with the false beliefs that there is no principled way to untangle them, no unique way of extrapolating him that should be considered “correct”.
Beware: you are making a common sense-based prediction about what would be the output of a process that you don’t even have the right concepts for specifying! (See my reply to your other comment.)
It is true that I should sprinkle copious amounts of uncertainty on this prediction.
Wow. Too bad I missed this when it was first posted. It’s what I wish I’d said when justifying my reply to Wei_Dai’s attempted belief/values dichotomy here and here.
I don’t fully agree with Ricky here, but I think he makes a half-good point.
The ungood part of his comment—and mine—is that you can only do your best. If certain people’s minds are too messed up to actually extract values from, then they are just not salvageable. My mind definitely has values that are belief-independent, though perhaps not all of what I think of as “my values” have this nice property, so ultimately they might be garbage.
Indeed. Most of the FAI’s job could consist of saying, “Okay, there’s soooooo much I have to disentangle and correct before I can even begin to propose solutions. Sit down and let’s talk.”
Furthermore, from the CEV thread on SIAI blog:
I strongly agree with this: the problem that CEV is the solution to is urgent but it isn’t elegant. Absolutes like “There isn’t a beliefs/desires separation” are unhelpful when solving such inelegant but important problems. There is, in any given person, some kind of separation, and in some people that separation is sufficiently strong that there is a fairly clear and unique way to help them.
One lesson of reductionism and success of simple-laws-based science and technology is that for the real-world systems, there might be no simple way of describing them, but there could be a simple way of manipulating their data-rich descriptions. (What’s the yield strength of a car? -- Wrong question!) Given a gigabyte’s worth of problem statement and the right simple formula, you could get an answer to your query. There is a weak analogy with misapplication of Occam’s razor where one tries to reduce the amount of stuff rather than the amount of detail in the ways of thinking about this stuff.
In the case of beliefs/desires separation, you are looking for a simple problem statement, for a separation in the data describing the person itself. But what you should be looking for is a simple way of implementing the make-smarter-and-better extrapolation on a given pile of data. The beliefs/desires separation, if it’s ever going to be made precise, is going to reside in the structure of this simple transformation, not in the people themselves.
This is a good point.
Of course, it would be nice if we could find a general “make-smarter-and-better extrapolation on a given pile of data” algorithm.
But on the other hand, a set of special cases to deal with merely human minds might be the way forward. Even medieval monks had a collection of empirically validated medical practices that worked to an extent, e.g. herbal medicine, but they had no unified theory. Really there is no “unified theory” for healing someone’s body: there are lots of ideas and techniques, from surgery to biochemistry to germ theory. I think that this CEV problem may well turn out to be rather like medicine. Of course, it could look more like wing design, where there is really just one fundamental set of laws, and all else is approximation.
And there’s your “opinion or interpretation”—not just in how you draw the boundary (which didn’t exist in the original ontology), but in your choice of the theory that you use to evaluate your counterfactuals.
Of course, such theories can be better or worse, but only with respect to some prior system of evaluation.
Still, probably a question of Aristotelian vs. Newtonian mechanics, i.e. not hard to see who wins.
Agreed, but not responsive to Mitchell Porter’s original point. (ETA: . . . unless I’m missing your point.)
But to what extent does the result depend on the initial “seed” of interpretation? Maybe, very little. For example, prediction of behavior of a given physical system strictly speaking rests on the problem of induction, but that doesn’t exactly say that anything goes or that what will actually happen is to any reasonable extent ambiguous.
I’d upvote this comment twice if I could.
p(wedrifid would upvote a comment twice | he upvoted it once) > 0.95
Would other people have a different approach?
I’d use some loose scale where the quality of the comment correlated with the amount of upvotes it got. Assuming that a user could give up to two upvotes per comment, then a funny one-liner or a moderately interesting comment would get one vote, truly insightful ones two.
p(Kaj would upvote a comment twice | he upvoted it once) would probably be somewhere around [.3, .6]
That’s the scale I use. Unfortunately, my ability to (directly) influence how many upvotes it gets is limited to a plus or minus one shift.