I’m trying to understand what do you mean by human prior here. Image classification models are vulnerable to adversarial examples. Suppose I randomly split an image dataset into D and D* and train an image classifier using your method. Do you predict that it will still be vulnerable to adversarial examples?
Yes. We’re just aiming for a distillation of the overseer’s judgments, but that’s what existing imagenet models are anyway, so we’ll be vulnerable to adversarial examples for the same reason.
I’m trying to understand what do you mean by human prior here. Image classification models are vulnerable to adversarial examples. Suppose I randomly split an image dataset into D and D* and train an image classifier using your method. Do you predict that it will still be vulnerable to adversarial examples?
Yes. We’re just aiming for a distillation of the overseer’s judgments, but that’s what existing imagenet models are anyway, so we’ll be vulnerable to adversarial examples for the same reason.