Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.
Again, this seems less true the better your adversarial training setup is at identifying the test cases in which you’re likely to be misaligned?
Again, this seems less true the better your adversarial training setup is at identifying the test cases in which you’re likely to be misaligned?