I had been weakly leaning towards the idea that a solution to the pointers problem should be a solution to deferral—i.e. it tells us when the agent defers to the AI’s world model, and what mapping it uses to translate AI-variables to agent-variables. This makes me lean more in that direction.
What I’d like to add to this post would be the point that we shouldn’t be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don’t think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don’t need to.)
I see a couple different claims mixed together here:
The metaphilosophical problem of how we “should” handle this problem is sufficient and/or necessary to solve in its own right.
There probably isn’t a general way to find correspondences between models, so we need to operate at the meta-level.
The main thing I disagree with is the idea that there probably isn’t a general way to find correspondences between models. There are clearly cases where correspondence fails outright (like the ghosts example), but I think the problem is probably solvable allowing for error-cases (by which I mean cases where the correspondence throws an error, not cases in which the correspondence returns an incorrect result). Furthermore, assuming that natural abstractions work the way I think they do, I think the problem is solvable in practice with relatively few error cases and potentially even using “prosaic” AI world-models. It’s the sort of thing which would dramatically improve the success chances of alignment by default.
I absolutely do agree that we still need the metaphilosophical stuff for a first-best solution. In particular, there is not an obviously-correct way to handle the correspondence error-cases, and of course anything else in the whole setup can also be close-but-not-exactly-right . I do think that combining a solution to the pointers problem with something like the communication prior strategy, plus some obvious tweaks like partially-ordered preferences and some model of logical uncertainty, would probably be enough to land us in the basin of convergence (assuming the starting model was decent), but even then I’d prefer metaphilosophical tools to be confident that something like that would work.
This makes a lot of sense.
I had been weakly leaning towards the idea that a solution to the pointers problem should be a solution to deferral—i.e. it tells us when the agent defers to the AI’s world model, and what mapping it uses to translate AI-variables to agent-variables. This makes me lean more in that direction.
I see a couple different claims mixed together here:
The metaphilosophical problem of how we “should” handle this problem is sufficient and/or necessary to solve in its own right.
There probably isn’t a general way to find correspondences between models, so we need to operate at the meta-level.
The main thing I disagree with is the idea that there probably isn’t a general way to find correspondences between models. There are clearly cases where correspondence fails outright (like the ghosts example), but I think the problem is probably solvable allowing for error-cases (by which I mean cases where the correspondence throws an error, not cases in which the correspondence returns an incorrect result). Furthermore, assuming that natural abstractions work the way I think they do, I think the problem is solvable in practice with relatively few error cases and potentially even using “prosaic” AI world-models. It’s the sort of thing which would dramatically improve the success chances of alignment by default.
I absolutely do agree that we still need the metaphilosophical stuff for a first-best solution. In particular, there is not an obviously-correct way to handle the correspondence error-cases, and of course anything else in the whole setup can also be close-but-not-exactly-right . I do think that combining a solution to the pointers problem with something like the communication prior strategy, plus some obvious tweaks like partially-ordered preferences and some model of logical uncertainty, would probably be enough to land us in the basin of convergence (assuming the starting model was decent), but even then I’d prefer metaphilosophical tools to be confident that something like that would work.