This comment seems wrong to me in ways that make me think I’m missing your point.
Some examples and what seems wrong about them, with the understanding that I’m probably misunderstanding what you’re trying to point to:
we’re non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn’t make sense
I have no idea why this would be tied to non-Cartesian-ness.
But in the real world, humans don’t have a unique set of True Values or even a unique model of the world
There are certainly ways in which humans diverge from Bayesian utility maximization, but I don’t see why we would think that values or models are non-unique. Certainly we use multiple levels of abstraction, or multiple sub-models, but that’s quite different from having multiple distinct world-models.
Thus in the real world we cannot require that the AI has to maximize humans’ True Values, we can only ask that it models humans [...] and satisfy the modeled values.
How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say “just pick one set of values/one world model and satisfy that”, which seems wrong.
One way to interpret all this is that you’re pointing to things like submodels, subagents, multiple abstraction levels, etc. But then I don’t see why the problem would be any easier in the real world than in the model, since all of those things can be expressed in the model (or a straightforward extension of the model, in the case of subagents).
Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.
If you don’t agree with me on this, why didn’t you reply when I spent about six months just writing posts that were all variations of this idea? Here’s Scott Alexander making the basic point.
It’s like… is there a True rational approximation of pi? Well, 22⁄7 is pretty good, but 355⁄113 is more precise, if harder to remember. And just 3 is really easy to remember, but not as precise. And of course there’s the arbitrarily large “approximation” that is 3.141592… Depending on what you need to use it for, you might have different preferences about the tradeoff between simplicity and precision. There is no True rational approximation of pi. True Human Values are similar, except instead of one tradeoff that you can make it’s approximately one bajillion.
we’re non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn’t make sense
I have no idea why this would be tied to non-Cartesian-ness.
If a Cartesian agent was talking about their values, they could just be like “you know, those things that are specified as my values in the logic-stuff my mind is made out of.” (Though this assumes some level of introspective access / genre savviness that needn’t be assumed, so if you don’t want to assume this then we can just say I was mistaken.). When a human talks about their values they can’t take that shortcut, and instead have to specify values as a function of how they affect their behavior. This introduces the dependency on how we’re breaking down the world into categories like “human behavior.”
Thus in the real world we cannot require that the AI has to maximize humans’ True Values, we can only ask that it models humans [...] and satisfy the modeled values.
How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say “just pick one set of values/one world model and satisfy that”, which seems wrong.
Well, if there were unique values, we could say “maximize the unique values.” Since there aren’t, we can’t. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we’re going to have to do with that wrong-seeming.
Well, if there were unique values, we could say “maximize the unique values.” Since there aren’t, we can’t. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we’re going to have to do with that wrong-seeming.
Before I get into the meat of the response… I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human’s world-model, which still gives rise to all the same problems as a total order in the human’s world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.)
Anyway, I don’t think that’s the main disconnect here...
Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.
Ok, I think I see what you’re saying now. I am of course on board with the notion that e.g. human values do not make sense when we’re modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.
However, there is a difference between “the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction”, and “there is not a unique, well-defined referent of ‘human values’”. The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human’s model.
An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its “goals” do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different “goals” to the robot—e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.
And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. “The utility function I hard-coded into the robot” is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that “the utility function I hard-coded into the robot” would correspond to some unambiguous thing in the real world—although specifying exactly what that thing is, is an instance of the pointers problem.
Does that make sense? Am I still missing something here?
This comment seems wrong to me in ways that make me think I’m missing your point.
Some examples and what seems wrong about them, with the understanding that I’m probably misunderstanding what you’re trying to point to:
I have no idea why this would be tied to non-Cartesian-ness.
There are certainly ways in which humans diverge from Bayesian utility maximization, but I don’t see why we would think that values or models are non-unique. Certainly we use multiple levels of abstraction, or multiple sub-models, but that’s quite different from having multiple distinct world-models.
How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say “just pick one set of values/one world model and satisfy that”, which seems wrong.
One way to interpret all this is that you’re pointing to things like submodels, subagents, multiple abstraction levels, etc. But then I don’t see why the problem would be any easier in the real world than in the model, since all of those things can be expressed in the model (or a straightforward extension of the model, in the case of subagents).
Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.
If you don’t agree with me on this, why didn’t you reply when I spent about six months just writing posts that were all variations of this idea? Here’s Scott Alexander making the basic point.
It’s like… is there a True rational approximation of pi? Well, 22⁄7 is pretty good, but 355⁄113 is more precise, if harder to remember. And just 3 is really easy to remember, but not as precise. And of course there’s the arbitrarily large “approximation” that is 3.141592… Depending on what you need to use it for, you might have different preferences about the tradeoff between simplicity and precision. There is no True rational approximation of pi. True Human Values are similar, except instead of one tradeoff that you can make it’s approximately one bajillion.
If a Cartesian agent was talking about their values, they could just be like “you know, those things that are specified as my values in the logic-stuff my mind is made out of.” (Though this assumes some level of introspective access / genre savviness that needn’t be assumed, so if you don’t want to assume this then we can just say I was mistaken.). When a human talks about their values they can’t take that shortcut, and instead have to specify values as a function of how they affect their behavior. This introduces the dependency on how we’re breaking down the world into categories like “human behavior.”
Well, if there were unique values, we could say “maximize the unique values.” Since there aren’t, we can’t. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we’re going to have to do with that wrong-seeming.
Before I get into the meat of the response… I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human’s world-model, which still gives rise to all the same problems as a total order in the human’s world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.)
Anyway, I don’t think that’s the main disconnect here...
Ok, I think I see what you’re saying now. I am of course on board with the notion that e.g. human values do not make sense when we’re modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.
However, there is a difference between “the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction”, and “there is not a unique, well-defined referent of ‘human values’”. The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human’s model.
An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its “goals” do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different “goals” to the robot—e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.
And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. “The utility function I hard-coded into the robot” is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that “the utility function I hard-coded into the robot” would correspond to some unambiguous thing in the real world—although specifying exactly what that thing is, is an instance of the pointers problem.
Does that make sense? Am I still missing something here?