A common justification I hear for adversarial training is robustness to underspecification/slight misspecification. For example, we might not know what parameters the real world has, and our best guesses are probably wrong. We want our AI to perform acceptably regardless of what parameters the world has, so we perform adversarial domain randomization. This falls under 1. as written insofar we can say that the AI learnt a stupid heuristic (use a policy that depends on the specific parameters of the environment for example) and we want it to learn something more clever (infer the parameters of the world, then do the optimal thing under those parameters). We want a model whose performance is invariant to (or at least robust to) underspecified parameters, but by default our models don’t know which parameters are underspecified. Adversarial training helps insofar as it teaches our models which things they should not pay attention to.
In the image classification example you gave for 1., I’d frame it as: * The training data doesn’t disambiguate between “the intended classification boundary is whether or not the image contains a dog” and “the intended classification boundary is whether or not the image contains a high frequency artifact that results from how the researchers preprocessed their dog images”, as the two features are extremely correlated on the training data. The training data has fixed a parameter that we wanted to leave unspecified (which image preprocessing we did, or equivalently, which high frequency artifacts we left in the data). * Over the course of normal (non-adversarial) training, we learn a model that depends on the specific settings of the parameters that we wanted to leave unspecified (that is, the specific artifacts). * We can grant an adversary the ability to generate images that break the high frequency artifacts while preserving the ground truth of if the image contains a dog. For example, the Lp norm-ball attacks oft studied in the adversarial image classification literature. * By training the model on data generated by said adversary, the model can now learn that the specific pattern of high frequency patterns doesn’t matter, and can now spend more its capacity paying attention to whether or not the image contains dog features.
A common justification I hear for adversarial training is robustness to underspecification/slight misspecification. For example, we might not know what parameters the real world has, and our best guesses are probably wrong. We want our AI to perform acceptably regardless of what parameters the world has, so we perform adversarial domain randomization. This falls under 1. as written insofar we can say that the AI learnt a stupid heuristic (use a policy that depends on the specific parameters of the environment for example) and we want it to learn something more clever (infer the parameters of the world, then do the optimal thing under those parameters). We want a model whose performance is invariant to (or at least robust to) underspecified parameters, but by default our models don’t know which parameters are underspecified. Adversarial training helps insofar as it teaches our models which things they should not pay attention to.
In the image classification example you gave for 1., I’d frame it as:
* The training data doesn’t disambiguate between “the intended classification boundary is whether or not the image contains a dog” and “the intended classification boundary is whether or not the image contains a high frequency artifact that results from how the researchers preprocessed their dog images”, as the two features are extremely correlated on the training data. The training data has fixed a parameter that we wanted to leave unspecified (which image preprocessing we did, or equivalently, which high frequency artifacts we left in the data).
* Over the course of normal (non-adversarial) training, we learn a model that depends on the specific settings of the parameters that we wanted to leave unspecified (that is, the specific artifacts).
* We can grant an adversary the ability to generate images that break the high frequency artifacts while preserving the ground truth of if the image contains a dog. For example, the Lp norm-ball attacks oft studied in the adversarial image classification literature.
* By training the model on data generated by said adversary, the model can now learn that the specific pattern of high frequency patterns doesn’t matter, and can now spend more its capacity paying attention to whether or not the image contains dog features.