I’ve been thinking about AI corrigibility lately and have come up with a potential solution that probably has been refuted, but I’m not aware of a refutation.
The solution I’m proposing is to condition both the actor and the critic on a goal-representing vector g, change it multiple times during training when the model is still weak, and add a baseline to the value function to ensure it doesn’t change when the goal is changed. In other words, we want the agent to not instrumentally-care about its goals. For example, if we switch the goal from maximizing paperclips to minimizing paperclips, the model would be trained to maximize the number of paperclips it would-have-produced, and punished during training for wasting efforts on controlling its goals. Sort of like when we play a game, and sometimes don’t care to stop it in the middle or change the rules in favor of the opponent (e.g. letting them go back and change moves), if the opponent admit that we would probably have won—because we get the same amount of prestige they expect to get it we continue playing. In such setups, we are not motivated to choose moves based on how likely they are to make the opponent want to continue/stop.
I haven’t been able to identify any obvious flaws in it, and I’m curious to hear from the community if they know of any serious problems or can think of any. My best guess is that the path dependence created by the baselines may allow the model to “pump value” somehow—but I don’t see a specific mechanism that seem simpler or otherwise more likely to evolve than corrigibility.
This sort of works, but not enough to solve it.
A core problem lies in the distribution of goals that you vary things over. The AI will be trained to be corrigible within the range of that distribution, but there is no particular guarantee that it will be corrigible outside it.
So you need to make sure that your distribution of goals contain human values. How do you guarantee that it contains that without getting goodharted by instead containing something that superficially resembles human values?
It might be tempting to achieve this by making the distribution very general, with lots of varied goals, so it contains lots of alien values including human values. But then human values are given exponentially snall probability, which utility-wise is similar to the distribution not containing human values.
So you need to somehow give human values a high probability within the distribution. But at that point you’re most of the way to just figuring out what human values are in the first place and directly aligning to them.
“which utility-wise is similar to the distribution not containing human values.” - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don’t see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than “don’t care about controling goals”. For capabilities my intuition is that starting with superficially-aligned goals is enough.
Hmm, I think I retract my point. I suspect something similar to my point applies but as written it doesn’t 100% fit and I can’t quickly analyze your proposal and apply my point to it.
More on the meta level: “This sort of works, but not enough to solve it.”—do you mean “not enough” as in “good try but we probably need something else” or as in “this is a promising direction, just solve some tractable downstream problem”?