evhub comments on Relaxed adversarial training for inner alignment

evhub 9 Jun 2020 22:13 UTC
LW: 2 AF: 1
AF
Yep—that’s one of the main concerns. The idea, though, is that all you have to deal with should be a standard overfitting problem, since you don’t need the acceptability predicate to work once the model is deceptive, only beforehand. Thus, you should only have to worry about gradient descent overfitting to the acceptability signal, not the model actively trying to trick you—which I think is solvable overfitting problem. Currently, my hope is that you can do that via using the acceptability signal to enforce an easy-to-verify condition that rules out deception such as myopia.