I’d like to propose the idea of aligning AI by reverse-engineering its world model and using this to specify its behavior or utility function. I haven’t seen this discussed before, but I would greatly appreciate feedback or links to any past work on this.
For example, suppose a smart AI models humans. Suppose it has a model that explicitly specifies the humans’ preferences. Then people who reverse-engineered this model could use it as the AI’s preferences. If the AI lacks a model with explicit preferences, then I think it would still contain an accurate model of human behavior. So people who reverse-engineer the AI’s model could then use this as a model of human behavior, which could be used to implement iterated amplification with HCH. Or just mere imitation.
One big potential advantage of alignment via reverse-engineering is that the training data for it would be very easy to get: just let the AI look at the world.
The other big potential advantage is that is avoids us needing precisely define a way of learning our values. It doesn’t require finding a general method of picking out us or our values from the world states, for example with inverse reinforcement learning. Instead, we would just need to be able to pick out the models of humans or their preferences in a single model. This sounds potentially much easier than providing a general method of doing so. As with many things, “You know it when you see it”. With sufficiently high interperability, perhaps the same is true of human models and preferences.
I’d like to propose the idea of aligning AI by reverse-engineering its world model and using this to specify its behavior or utility function. I haven’t seen this discussed before, but I would greatly appreciate feedback or links to any past work on this.
For example, suppose a smart AI models humans. Suppose it has a model that explicitly specifies the humans’ preferences. Then people who reverse-engineered this model could use it as the AI’s preferences. If the AI lacks a model with explicit preferences, then I think it would still contain an accurate model of human behavior. So people who reverse-engineer the AI’s model could then use this as a model of human behavior, which could be used to implement iterated amplification with HCH. Or just mere imitation.
One big potential advantage of alignment via reverse-engineering is that the training data for it would be very easy to get: just let the AI look at the world.
The other big potential advantage is that is avoids us needing precisely define a way of learning our values. It doesn’t require finding a general method of picking out us or our values from the world states, for example with inverse reinforcement learning. Instead, we would just need to be able to pick out the models of humans or their preferences in a single model. This sounds potentially much easier than providing a general method of doing so. As with many things, “You know it when you see it”. With sufficiently high interperability, perhaps the same is true of human models and preferences.