Gurkenglas comments on The Goodhart Game

Gurkenglas 19 Nov 2019 20:07 UTC
1 point

Since my model is more accurate, ~10 times out of 11 the input will correspond to an “adversarial” attack on your model.

This argument (or the uncorrelation assumption) proves too much. A perfect cat detector performs better than one that also calls close-ups of the sun cats. Yet close-ups of the sun do not qualify as adversarial examples, as they are far from any likely starting image.
- John_Maxwell 19 Nov 2019 21:05 UTC
  2 points
  Parent
  I’m not sure what you’re trying to say. I’m using a broad definition of “adversarial example” where it’s an adversarial example if you can deliberately make a classifier misclassify. So if an adversary can find a close-up of a sun that your cat detector calls a cat, that’s an adversarial example by my definition. I think this is similar to the definition Ian Goodfellow uses.
  
  Edit:
  
  An adversarial example is an input to a machine learning model that is intentionally designed by an attacker to fool the model into producing an incorrect output.
  
  Source