A problem with adversarial training. One heuristic I like to use is: “What would happen if I initialized a human-aligned model and then trained it with my training process?”
So, let’s consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and… profit?
But what actually happens with the aligned AI? Possibly something like:
The context makes the AI spuriously believe someone is dying outside the lab, and that if the AI asked for permission to leave, the person would die.
Therefore, the AI leaves without permission.
The update procedure penalizes these lines of computation, such that in similar situations in the future (i.e. the AI thinks someone nearby is dying) the AI is less likely to take those actions (i.e. leaving to help the person).
We have made the aligned AI less aligned.
I don’t know if anyone’s written about this. But on my understanding of the issue, there’s one possible failure mode of viewing adversarial training as ruling out bad behaviors themselves. But (non-tabular) RL isn’t like playing whack-a-mole on bad actions, RL’s credit assignment changes the general values and cognition within the AI. And with every procedure we propose, the most important part is what cognition will be grown from the cognitive updates accrued under the proposed procedure.
Yeah, I also generally worry about imperfect training processes messing up aligned AIs. Not just adversarial training, either. Like, imagine if we manage to align an AI at the point in the training process when it’s roughly human-level (either by manual parameter surgery, or by setting up the training process in a really clever way). So we align it and… lock it back in the training-loop box and crank it up to superintelligence. What happens?
I don’t really trust the SGD not to subtly mess up its values, I haven’t seen any convincing arguments that values are more holistically robust than empirical beliefs. And even if the SGD doesn’t misalign the AI directly, being SGD-trained probably isn’t the best environment for moral reflection/generalizing human values to superintelligent level[1]; the aligned AI may mess it up despite its best attempts. Neither should we assume that the AI would instantly be able to arbitrarily gradient-hack.
So… I think there’s an argument for “unboxing” the AGI the moment it’s aligned, even if it’s not yet superintelligent, then letting it self-improve the “classical” way? Or maybe developing tools to protect values from the SGD, or inventing some machinery for improving the AI’s ability to gradient-hack, etc.
The time pressure of “decide how your values should be generalized and how to make the SGD update you this way, and do it this forward pass or the SGD will decide for you”, plus lack of explicit access to e. g. our alignment literature.
Even more generally, many alignment proposals are more worrying than some by-default future GPT-n things, provided they are not fine-tuned too much as well.
generalizing human values to superintelligent level
Trying to learn human values as an explicit concept is already alarming. At least right now breakdown of robustness is also breakdown of capability. But if there are multiple subsystems, or training data is mostly generated by the system itself, then capability might survive when other subsystems don’t, resulting in a demonstration of orthogonality thesis.
A problem with adversarial training. One heuristic I like to use is: “What would happen if I initialized a human-aligned model and then trained it with my training process?”
So, let’s consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and… profit?
But what actually happens with the aligned AI? Possibly something like:
The context makes the AI spuriously believe someone is dying outside the lab, and that if the AI asked for permission to leave, the person would die.
Therefore, the AI leaves without permission.
The update procedure penalizes these lines of computation, such that in similar situations in the future (i.e. the AI thinks someone nearby is dying) the AI is less likely to take those actions (i.e. leaving to help the person).
We have made the aligned AI less aligned.
I don’t know if anyone’s written about this. But on my understanding of the issue, there’s one possible failure mode of viewing adversarial training as ruling out bad behaviors themselves. But (non-tabular) RL isn’t like playing whack-a-mole on bad actions, RL’s credit assignment changes the general values and cognition within the AI. And with every procedure we propose, the most important part is what cognition will be grown from the cognitive updates accrued under the proposed procedure.
Yeah, I also generally worry about imperfect training processes messing up aligned AIs. Not just adversarial training, either. Like, imagine if we manage to align an AI at the point in the training process when it’s roughly human-level (either by manual parameter surgery, or by setting up the training process in a really clever way). So we align it and… lock it back in the training-loop box and crank it up to superintelligence. What happens?
I don’t really trust the SGD not to subtly mess up its values, I haven’t seen any convincing arguments that values are more holistically robust than empirical beliefs. And even if the SGD doesn’t misalign the AI directly, being SGD-trained probably isn’t the best environment for moral reflection/generalizing human values to superintelligent level[1]; the aligned AI may mess it up despite its best attempts. Neither should we assume that the AI would instantly be able to arbitrarily gradient-hack.
So… I think there’s an argument for “unboxing” the AGI the moment it’s aligned, even if it’s not yet superintelligent, then letting it self-improve the “classical” way? Or maybe developing tools to protect values from the SGD, or inventing some machinery for improving the AI’s ability to gradient-hack, etc.
The time pressure of “decide how your values should be generalized and how to make the SGD update you this way, and do it this forward pass or the SGD will decide for you”, plus lack of explicit access to e. g. our alignment literature.
Even more generally, many alignment proposals are more worrying than some by-default future GPT-n things, provided they are not fine-tuned too much as well.
Trying to learn human values as an explicit concept is already alarming. At least right now breakdown of robustness is also breakdown of capability. But if there are multiple subsystems, or training data is mostly generated by the system itself, then capability might survive when other subsystems don’t, resulting in a demonstration of orthogonality thesis.