I’m concerned about Goodhart’s law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.
Yep—that’s one of the main concerns. The idea, though, is that all you have to deal with should be a standard overfitting problem, since you don’t need the acceptability predicate to work once the model is deceptive, only beforehand. Thus, you should only have to worry about gradient descent overfitting to the acceptability signal, not the model actively trying to trick you—which I think is solvable overfitting problem. Currently, my hope is that you can do that via using the acceptability signal to enforce an easy-to-verify condition that rules out deception such as myopia.
I’m concerned about Goodhart’s law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.
Yep—that’s one of the main concerns. The idea, though, is that all you have to deal with should be a standard overfitting problem, since you don’t need the acceptability predicate to work once the model is deceptive, only beforehand. Thus, you should only have to worry about gradient descent overfitting to the acceptability signal, not the model actively trying to trick you—which I think is solvable overfitting problem. Currently, my hope is that you can do that via using the acceptability signal to enforce an easy-to-verify condition that rules out deception such as myopia.