Charlie Steiner comments on An observation about Hubinger et al.’s framework for learned optimization

Charlie Steiner 14 May 2022 4:17 UTC
2 points
Suppose I have two robots, one of which wants to turn the world into paperclips and the other of which wants to turn the world into staples.

Your argument here seems to extend to saying that we can’t call these robots or their preferences “misaligned with each other,” because robot A’s preferences are used to search over actions of robot A, and vice versa for robot B.

I don’t think that argument makes sense. The action spaces are different, but both robots are still trying to affect the same world and steer it in different directions. We could formalize this by defining for each robot a utility function over states of the world.

There is one important type signature point here, which is that robots are made of atoms and utility functions are not. The robots’ utility functions don’t live in the physical robots (they’re not the right type of stuff), they actually live in our abstract model of the robots. This doesn’t mean it’s futile to compare things, though—it’s fine to use abstractions within their domains of validity.