I think that one of the problems in this post is actually easier in the real world than in the toy model.
In the toy model the AI has to succeed by maximizing the agent’s True Values, which the agent is assumed to have as a unique function over its model of the world. This is a very tricky problem, especially when, as you point out, we might allow the agent’s model of reality to be wrong in places.
But in the real world, humans don’t have a unique set of True Values or even a unique model of the world—we’re non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn’t make sense.
Thus in the real world we cannot require that the AI has to maximize humans’ True Values, we can only ask that it models humans (and we might have desiderata about how it does that modeling and what the end results should contain), and satisfy the modeled values. And in some ways this is actually a bit reassuring, because I’m pretty sure that it’s possible to get better final results on this problem than on than learning the toy model agent’s True Values—maybe not in the most simple case, but as you add things like lack of introspection, distributional shift, meta-preferences like identifying some behavior as “bias,” etc.
EDIT FROM THE FUTURE: Stumbled across this comment 10 months later, and felt like my writing style was awkward and hard to understand—followed by a “click” where suddenly it became obvious and natural to me. Now I worry everyone else gets the awkward and hard to understand version all the time.
This comment seems wrong to me in ways that make me think I’m missing your point.
Some examples and what seems wrong about them, with the understanding that I’m probably misunderstanding what you’re trying to point to:
we’re non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn’t make sense
I have no idea why this would be tied to non-Cartesian-ness.
But in the real world, humans don’t have a unique set of True Values or even a unique model of the world
There are certainly ways in which humans diverge from Bayesian utility maximization, but I don’t see why we would think that values or models are non-unique. Certainly we use multiple levels of abstraction, or multiple sub-models, but that’s quite different from having multiple distinct world-models.
Thus in the real world we cannot require that the AI has to maximize humans’ True Values, we can only ask that it models humans [...] and satisfy the modeled values.
How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say “just pick one set of values/one world model and satisfy that”, which seems wrong.
One way to interpret all this is that you’re pointing to things like submodels, subagents, multiple abstraction levels, etc. But then I don’t see why the problem would be any easier in the real world than in the model, since all of those things can be expressed in the model (or a straightforward extension of the model, in the case of subagents).
Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.
If you don’t agree with me on this, why didn’t you reply when I spent about six months just writing posts that were all variations of this idea? Here’s Scott Alexander making the basic point.
It’s like… is there a True rational approximation of pi? Well, 22⁄7 is pretty good, but 355⁄113 is more precise, if harder to remember. And just 3 is really easy to remember, but not as precise. And of course there’s the arbitrarily large “approximation” that is 3.141592… Depending on what you need to use it for, you might have different preferences about the tradeoff between simplicity and precision. There is no True rational approximation of pi. True Human Values are similar, except instead of one tradeoff that you can make it’s approximately one bajillion.
we’re non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn’t make sense
I have no idea why this would be tied to non-Cartesian-ness.
If a Cartesian agent was talking about their values, they could just be like “you know, those things that are specified as my values in the logic-stuff my mind is made out of.” (Though this assumes some level of introspective access / genre savviness that needn’t be assumed, so if you don’t want to assume this then we can just say I was mistaken.). When a human talks about their values they can’t take that shortcut, and instead have to specify values as a function of how they affect their behavior. This introduces the dependency on how we’re breaking down the world into categories like “human behavior.”
Thus in the real world we cannot require that the AI has to maximize humans’ True Values, we can only ask that it models humans [...] and satisfy the modeled values.
How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say “just pick one set of values/one world model and satisfy that”, which seems wrong.
Well, if there were unique values, we could say “maximize the unique values.” Since there aren’t, we can’t. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we’re going to have to do with that wrong-seeming.
Well, if there were unique values, we could say “maximize the unique values.” Since there aren’t, we can’t. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we’re going to have to do with that wrong-seeming.
Before I get into the meat of the response… I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human’s world-model, which still gives rise to all the same problems as a total order in the human’s world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.)
Anyway, I don’t think that’s the main disconnect here...
Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.
Ok, I think I see what you’re saying now. I am of course on board with the notion that e.g. human values do not make sense when we’re modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.
However, there is a difference between “the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction”, and “there is not a unique, well-defined referent of ‘human values’”. The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human’s model.
An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its “goals” do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different “goals” to the robot—e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.
And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. “The utility function I hard-coded into the robot” is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that “the utility function I hard-coded into the robot” would correspond to some unambiguous thing in the real world—although specifying exactly what that thing is, is an instance of the pointers problem.
Does that make sense? Am I still missing something here?
I think that one of the problems in this post is actually easier in the real world than in the toy model.
In the toy model the AI has to succeed by maximizing the agent’s True Values, which the agent is assumed to have as a unique function over its model of the world. This is a very tricky problem, especially when, as you point out, we might allow the agent’s model of reality to be wrong in places.
But in the real world, humans don’t have a unique set of True Values or even a unique model of the world—we’re non-Cartesian, which means that when we talk about our values, we are assuming a specific sort of way of talking about the world, and there are other ways of talking about the world in which talk about our values doesn’t make sense.
Thus in the real world we cannot require that the AI has to maximize humans’ True Values, we can only ask that it models humans (and we might have desiderata about how it does that modeling and what the end results should contain), and satisfy the modeled values. And in some ways this is actually a bit reassuring, because I’m pretty sure that it’s possible to get better final results on this problem than on than learning the toy model agent’s True Values—maybe not in the most simple case, but as you add things like lack of introspection, distributional shift, meta-preferences like identifying some behavior as “bias,” etc.
EDIT FROM THE FUTURE: Stumbled across this comment 10 months later, and felt like my writing style was awkward and hard to understand—followed by a “click” where suddenly it became obvious and natural to me. Now I worry everyone else gets the awkward and hard to understand version all the time.
This comment seems wrong to me in ways that make me think I’m missing your point.
Some examples and what seems wrong about them, with the understanding that I’m probably misunderstanding what you’re trying to point to:
I have no idea why this would be tied to non-Cartesian-ness.
There are certainly ways in which humans diverge from Bayesian utility maximization, but I don’t see why we would think that values or models are non-unique. Certainly we use multiple levels of abstraction, or multiple sub-models, but that’s quite different from having multiple distinct world-models.
How does this follow from non-uniqueness of values/world models? If humans have more than one set of values, or more than one world model, then this seems to say “just pick one set of values/one world model and satisfy that”, which seems wrong.
One way to interpret all this is that you’re pointing to things like submodels, subagents, multiple abstraction levels, etc. But then I don’t see why the problem would be any easier in the real world than in the model, since all of those things can be expressed in the model (or a straightforward extension of the model, in the case of subagents).
Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.
If you don’t agree with me on this, why didn’t you reply when I spent about six months just writing posts that were all variations of this idea? Here’s Scott Alexander making the basic point.
It’s like… is there a True rational approximation of pi? Well, 22⁄7 is pretty good, but 355⁄113 is more precise, if harder to remember. And just 3 is really easy to remember, but not as precise. And of course there’s the arbitrarily large “approximation” that is 3.141592… Depending on what you need to use it for, you might have different preferences about the tradeoff between simplicity and precision. There is no True rational approximation of pi. True Human Values are similar, except instead of one tradeoff that you can make it’s approximately one bajillion.
If a Cartesian agent was talking about their values, they could just be like “you know, those things that are specified as my values in the logic-stuff my mind is made out of.” (Though this assumes some level of introspective access / genre savviness that needn’t be assumed, so if you don’t want to assume this then we can just say I was mistaken.). When a human talks about their values they can’t take that shortcut, and instead have to specify values as a function of how they affect their behavior. This introduces the dependency on how we’re breaking down the world into categories like “human behavior.”
Well, if there were unique values, we could say “maximize the unique values.” Since there aren’t, we can’t. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we’re going to have to do with that wrong-seeming.
Before I get into the meat of the response… I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human’s world-model, which still gives rise to all the same problems as a total order in the human’s world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.)
Anyway, I don’t think that’s the main disconnect here...
Ok, I think I see what you’re saying now. I am of course on board with the notion that e.g. human values do not make sense when we’re modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.
However, there is a difference between “the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction”, and “there is not a unique, well-defined referent of ‘human values’”. The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human’s model.
An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its “goals” do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different “goals” to the robot—e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.
And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. “The utility function I hard-coded into the robot” is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that “the utility function I hard-coded into the robot” would correspond to some unambiguous thing in the real world—although specifying exactly what that thing is, is an instance of the pointers problem.
Does that make sense? Am I still missing something here?