johnswentworth comments on The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

johnswentworth 19 Nov 2020 0:18 UTC
LW: 9 AF: 3
AF
Well, if there were unique values, we could say “maximize the unique values.” Since there aren’t, we can’t. We can still do some similar things, and I agree, those do seem wrong. See this post for basically my argument for what we’re going to have to do with that wrong-seeming.
Before I get into the meat of the response… I certainly agree that values are probably a partial order, not a total order. However, that still leaves basically all the problems in the OP: that partial order is still a function of latent variables in the human’s world-model, which still gives rise to all the same problems as a total order in the human’s world-model. (Intuitive way to conceptualize this: we can represent the partial order as a set of total orders, i.e. represent the human as a set of utility-maximizing subagents. Each of those subagents is still a normal Bayesian utility maximizer, and still suffers from the problems in the OP.)
Anyway, I don’t think that’s the main disconnect here...
Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.
Ok, I think I see what you’re saying now. I am of course on board with the notion that e.g. human values do not make sense when we’re modelling the human at the level of atoms. I also agree that the physical system which comprises a human can be modeled as wanting different things at different levels of abstraction.
However, there is a difference between “the physical system which comprises a human can be interpreted as wanting different things at different levels of abstraction”, and “there is not a unique, well-defined referent of ‘human values’”. The former does not imply the latter. Indeed, the difference is essentially the same issue in the OP: one of these statements has a type-signature which lives in the physical world, while the other has a type-signature which lives in a human’s model.
An analogy: consider a robot into which I hard-code a utility function and world model. This is a physical robot; on the level of atoms, its “goals” do not exist in any more real a sense than human values do. As with humans, we can model the robot at multiple levels of abstraction, and these different models may ascribe different “goals” to the robot—e.g. modelling it at the level of an electronic circuit or at the level of assembly code may ascribe different goals to the system, there may be subsystems with their own little control loops, etc.
And yet, when I talk about the utility function I hard-coded into the robot, there is no ambiguity about which thing I am talking about. “The utility function I hard-coded into the robot” is a concept within my own world-model. That world-model specifies the relevant level of abstraction at which the concept lives. And it seems pretty clear that “the utility function I hard-coded into the robot” would correspond to some unambiguous thing in the real world—although specifying exactly what that thing is, is an instance of the pointers problem.
Does that make sense? Am I still missing something here?
What links here?
- Noosphere89's comment on A shot at the diamond-alignment problem by TurnTrout (26 Dec 2024 16:11 UTC; 8 points)