Thus, the criterion for ascribing preferences to a physical system is that the actual physics has to be well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”.
I don’t think this simple characterisation resembles the truth: the whole point of this enterprise is to make sure things go differently, in a way they just couldn’t proceed by themselves. Thus, observing existing “tendencies” doesn’t quite capture the idea of preference.
make sure things go differently, in a way they just couldn’t proceed by themselves. Thus, observing existing “tendencies” doesn’t quite capture the idea of preference.
I should have been clearer: you have to draw a boundary around the “optimizing agent”, and look at the difference between the tendencies of the environment without the optimizer, and the tendencies of the environment with the optimizer. If the difference is well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”, then you have an optimizer.
I don’t hear differently… I even suspect that preference is introspective, that is depends on a way the system works “internally”, not just on how it interacts with environment. That is, two agents with different preferences may do exactly the same thing in all contexts. Even if not, it’s a long way between how the agent (in its craziness and stupidity) actually changes the environment, and how it would prefer (on reflection, if it was smarter and saner) the environment to change.
Even if not, it’s a long way between how the agent (in its craziness and stupidity) actually changes the environment, and how it would prefer (on reflection, if it was smarter and saner) the environment to change
That is true. If the agent has a well-defined “predictive module” which has a “map” (probability distribution over the environment given an interaction history), and some “other stuff”, then you can clamp the predictive module down to the truth, and then perform what I said before:
look at the difference between the tendencies of the environment without the optimizer, and the tendencies of the environment with the optimizer. If the difference is well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”, then you have an optimizer.
And you probably also want to somehow formalize the idea that there is a difference between what an agent will try to achieve if it has only limited means—e.g. a lone human in a forest with no tools, clothes or other humans—and what the agent will try to achieve with more powerful means—e.g, with machinery and tools, or in the limit, with a whole technological infrastructure, and unlimited computing power at it’s disposal.
I want to point out that in the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, we can’t just replace “incorrect” beliefs with “the truth”. In this interpretation, there can still be errors in one’s beliefs caused by things like past computational mistakes, and I think fixing those errors would constitute helping, but the prior perhaps needs to be preserved as part of preference.
I agree that the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, things get more complicated.
Actually, we had a discussion about my discomfort with your interpretation, and it seems that in order for me to see why you endorse this interpretation, I’d have to read up on various paradoxes, e.g. sleeping beauty.
If the agent has a well-defined “predictive module” which has a “map” (probability distribution over the environment given an interaction history), and some “other stuff”, then you can clamp the predictive module down to the truth, and then perform what I said before:
Comment by Ricky Loynd
Jun 23, 2007 7:39 am
Here’s my attempt to summarize a common point that Roko and I are trying to make. The underlying motivation for extrapolating volition sounds reasonable, but it depends critically on the AI’s ability to distinguish between goals and beliefs, between preferences and expectations, so that it can model human goals and preferences while substituting its own correct beliefs and expectations. But when you start dissecting most human goals and preferences, you find they contain deeper layers of belief and expectation. If you keep stripping those away, you eventually reach raw biological drives which are not a human belief or expectation. (Though even they are beliefs and expectations of evolution, but let’s ignore that for the moment.)
Once you strip away human beliefs and expectations, nothing remains but biological drives, which even the animals have. Yes, an animal, by virtue of its biological drives and ability to act, is more than a predicting rock, but that doesn’t address the issue at hand.
Why is it a tragedy when a loved one dies? Is it because the world no longer contains their particular genetic weighting of biological drives? Of course not. After all, they may have left an identical twin to carry forward the very same genetic combination. But it’s not the biology that matters to us. We grieve because what really made that person a person is now gone, and that’s all in the brain; the shared experiences, their beliefs whether correct or mistaken or indeterminate, their hopes and dreams, all those things that separate humans from animals, and indeed, that separate one human from most other humans. All that the brain absorbs and becomes throughout the course of a life, we call the soul, and we see it as our very humanity, that big, messy probability distribution describing our accumulated beliefs and expectations about ourselves, the universe, and our place in it.
So if the AI models a human while substituting its own beliefs and anticipations of future experiences, then the AI has discarded all that we value in each other. UNLESS you draw a line somewhere, and crisply define which human beliefs get replaced and which ones don’t. Constructing toy examples where such a line is possible to imagine does not mean that the distinction can be made in any general way, but CEV absolutely requires that there be a concrete distinction.
Constructing toy examples where such a line is possible to imagine does not mean that the distinction can be made in any general way, but CEV absolutely requires that there be a concrete distinction.
Basically, CEV works to the extent that there exists a belief/desire separation in a given person. In the thread on the SIAI blog, I posted certain cases where human goals are founded on false beliefs or logically inconsistent thinking, sometimes in complex ways. What is left of the time cube guy once you subtract off his false beliefs and delusions? Not much, probably. The guy is effectively not salvageable, because his identity and values are probably so badly tangled up with the false beliefs that there is no principled way to untangle them, no unique way of extrapolating him that should be considered “correct”.
What is left of the time cube guy once you subtract off his false beliefs and delusions? Not much, probably.
Beware: you are making a common sense-based prediction about what would be the output of a process that you don’t even have the right concepts for specifying! (See my reply to your other comment.)
Wow. Too bad I missed this when it was first posted. It’s what I wish I’d said when justifying my reply to Wei_Dai’s attempted belief/values dichotomy here and here.
I don’t fully agree with Ricky here, but I think he makes a half-good point.
The ungood part of his comment—and mine—is that you can only do your best. If certain people’s minds are too messed up to actually extract values from, then they are just not salvageable. My mind definitely has values that are belief-independent, though perhaps not all of what I think of as “my values” have this nice property, so ultimately they might be garbage.
Indeed. Most of the FAI’s job could consist of saying, “Okay, there’s soooooo much I have to disentangle and correct before I can even begin to propose solutions. Sit down and let’s talk.”
Comment by Eliezer Yudkowsky Jun 18, 2007 12:52 pm: I furthermore agree that it is not the most elegant idea I have ever had, but then it is trying to solve what appears to be an inherently inelegant problem.
I strongly agree with this: the problem that CEV is the solution to is urgent but it isn’t elegant. Absolutes like “There isn’t a beliefs/desires separation” are unhelpful when solving such inelegant but important problems. There is, in any given person, some kind of separation, and in some people that separation is sufficiently strong that there is a fairly clear and unique way to help them.
I strongly agree with this: the problem that CEV is the solution to is urgent but it isn’t elegant. Absolutes like “There isn’t a beliefs/desires separation” are unhelpful when solving such inelegant but important problems.
One lesson of reductionism and success of simple-laws-based science and technology is that for the real-world systems, there might be no simple way of describing them, but there could be a simple way of manipulating their data-rich descriptions. (What’s the yield strength of a car? -- Wrong question!) Given a gigabyte’s worth of problem statement and the right simple formula, you could get an answer to your query. There is a weak analogy with misapplication of Occam’s razor where one tries to reduce the amount of stuff rather than the amount of detail in the ways of thinking about this stuff.
In the case of beliefs/desires separation, you are looking for a simple problem statement, for a separation in the data describing the person itself. But what you should be looking for is a simple way of implementing the make-smarter-and-better extrapolation on a given pile of data. The beliefs/desires separation, if it’s ever going to be made precise, is going to reside in the structure of this simple transformation, not in the people themselves.
Of course, it would be nice if we could find a general “make-smarter-and-better extrapolation on a given pile of data” algorithm.
But on the other hand, a set of special cases to deal with merely human minds might be the way forward. Even medieval monks had a collection of empirically validated medical practices that worked to an extent, e.g. herbal medicine, but they had no unified theory. Really there is no “unified theory” for healing someone’s body: there are lots of ideas and techniques, from surgery to biochemistry to germ theory. I think that this CEV problem may well turn out to be rather like medicine. Of course, it could look more like wing design, where there is really just one fundamental set of laws, and all else is approximation.
[Y]ou have to draw a boundary around the “optimizing agent”, and look at the difference between the tendencies of the environment without the optimizer, and the tendencies of the environment with the optimizer.
And there’s your “opinion or interpretation”—not just in how you draw the boundary (which didn’t exist in the original ontology), but in your choice of the theory that you use to evaluate your counterfactuals.
Of course, such theories can be better or worse, but only with respect to some prior system of evaluation.
I don’t think this simple characterisation resembles the truth: the whole point of this enterprise is to make sure things go differently, in a way they just couldn’t proceed by themselves. Thus, observing existing “tendencies” doesn’t quite capture the idea of preference.
I should have been clearer: you have to draw a boundary around the “optimizing agent”, and look at the difference between the tendencies of the environment without the optimizer, and the tendencies of the environment with the optimizer. If the difference is well-approximated by a function that optimizes for a preferred state, for some value of “preferred state”, then you have an optimizer.
I don’t hear differently… I even suspect that preference is introspective, that is depends on a way the system works “internally”, not just on how it interacts with environment. That is, two agents with different preferences may do exactly the same thing in all contexts. Even if not, it’s a long way between how the agent (in its craziness and stupidity) actually changes the environment, and how it would prefer (on reflection, if it was smarter and saner) the environment to change.
That is true. If the agent has a well-defined “predictive module” which has a “map” (probability distribution over the environment given an interaction history), and some “other stuff”, then you can clamp the predictive module down to the truth, and then perform what I said before:
And you probably also want to somehow formalize the idea that there is a difference between what an agent will try to achieve if it has only limited means—e.g. a lone human in a forest with no tools, clothes or other humans—and what the agent will try to achieve with more powerful means—e.g, with machinery and tools, or in the limit, with a whole technological infrastructure, and unlimited computing power at it’s disposal.
I want to point out that in the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, we can’t just replace “incorrect” beliefs with “the truth”. In this interpretation, there can still be errors in one’s beliefs caused by things like past computational mistakes, and I think fixing those errors would constitute helping, but the prior perhaps needs to be preserved as part of preference.
I agree that the interpretation of prior as weights on possible universes, specifically as how much one cares about different universes, things get more complicated.
Actually, we had a discussion about my discomfort with your interpretation, and it seems that in order for me to see why you endorse this interpretation, I’d have to read up on various paradoxes, e.g. sleeping beauty.
Yeah, maybe. But it doesn’t.
Yeah, I mean this discussion is—rather amusingly—rather reminiscient of my first encounter with the CEV problem 2.5 years ago.
Basically, CEV works to the extent that there exists a belief/desire separation in a given person. In the thread on the SIAI blog, I posted certain cases where human goals are founded on false beliefs or logically inconsistent thinking, sometimes in complex ways. What is left of the time cube guy once you subtract off his false beliefs and delusions? Not much, probably. The guy is effectively not salvageable, because his identity and values are probably so badly tangled up with the false beliefs that there is no principled way to untangle them, no unique way of extrapolating him that should be considered “correct”.
Beware: you are making a common sense-based prediction about what would be the output of a process that you don’t even have the right concepts for specifying! (See my reply to your other comment.)
It is true that I should sprinkle copious amounts of uncertainty on this prediction.
Wow. Too bad I missed this when it was first posted. It’s what I wish I’d said when justifying my reply to Wei_Dai’s attempted belief/values dichotomy here and here.
I don’t fully agree with Ricky here, but I think he makes a half-good point.
The ungood part of his comment—and mine—is that you can only do your best. If certain people’s minds are too messed up to actually extract values from, then they are just not salvageable. My mind definitely has values that are belief-independent, though perhaps not all of what I think of as “my values” have this nice property, so ultimately they might be garbage.
Indeed. Most of the FAI’s job could consist of saying, “Okay, there’s soooooo much I have to disentangle and correct before I can even begin to propose solutions. Sit down and let’s talk.”
Furthermore, from the CEV thread on SIAI blog:
I strongly agree with this: the problem that CEV is the solution to is urgent but it isn’t elegant. Absolutes like “There isn’t a beliefs/desires separation” are unhelpful when solving such inelegant but important problems. There is, in any given person, some kind of separation, and in some people that separation is sufficiently strong that there is a fairly clear and unique way to help them.
One lesson of reductionism and success of simple-laws-based science and technology is that for the real-world systems, there might be no simple way of describing them, but there could be a simple way of manipulating their data-rich descriptions. (What’s the yield strength of a car? -- Wrong question!) Given a gigabyte’s worth of problem statement and the right simple formula, you could get an answer to your query. There is a weak analogy with misapplication of Occam’s razor where one tries to reduce the amount of stuff rather than the amount of detail in the ways of thinking about this stuff.
In the case of beliefs/desires separation, you are looking for a simple problem statement, for a separation in the data describing the person itself. But what you should be looking for is a simple way of implementing the make-smarter-and-better extrapolation on a given pile of data. The beliefs/desires separation, if it’s ever going to be made precise, is going to reside in the structure of this simple transformation, not in the people themselves.
This is a good point.
Of course, it would be nice if we could find a general “make-smarter-and-better extrapolation on a given pile of data” algorithm.
But on the other hand, a set of special cases to deal with merely human minds might be the way forward. Even medieval monks had a collection of empirically validated medical practices that worked to an extent, e.g. herbal medicine, but they had no unified theory. Really there is no “unified theory” for healing someone’s body: there are lots of ideas and techniques, from surgery to biochemistry to germ theory. I think that this CEV problem may well turn out to be rather like medicine. Of course, it could look more like wing design, where there is really just one fundamental set of laws, and all else is approximation.
And there’s your “opinion or interpretation”—not just in how you draw the boundary (which didn’t exist in the original ontology), but in your choice of the theory that you use to evaluate your counterfactuals.
Of course, such theories can be better or worse, but only with respect to some prior system of evaluation.
Still, probably a question of Aristotelian vs. Newtonian mechanics, i.e. not hard to see who wins.
Agreed, but not responsive to Mitchell Porter’s original point. (ETA: . . . unless I’m missing your point.)