Why do you think adversarial examples seem to behave this way? The pig equation seems equally compatible with fur or no fur recognized, wings or no wings. Indeed, it plausibly thinks the pig an airliner because it sees wings and no fur.
Then it has a wrong view of wings and fur (as well as a wrong view of pigs). The more features it has to get right, the harder the adversarial model is to construct—it’s not just moving linearly in a single direction.
Surely, the adversary convinces it this is a pig by convincing it that it has fur and no wings? I don’t have experience in how it works on the inside, but if the adversary can magically intervene on each neuron, changing its output by d by investing d² effort, then the proper strategy is to intervene on many features a little. Then if there are many layers, the penultimate layer containing such high level concepts as fur or wings would be almost as fooled as the output layer, and indeed I would expect the adversary to have more trouble fooling it on such low-level features as edges and dots.
Why do you think adversarial examples seem to behave this way? The pig equation seems equally compatible with fur or no fur recognized, wings or no wings. Indeed, it plausibly thinks the pig an airliner because it sees wings and no fur.
Then it has a wrong view of wings and fur (as well as a wrong view of pigs). The more features it has to get right, the harder the adversarial model is to construct—it’s not just moving linearly in a single direction.
Surely, the adversary convinces it this is a pig by convincing it that it has fur and no wings? I don’t have experience in how it works on the inside, but if the adversary can magically intervene on each neuron, changing its output by d by investing d² effort, then the proper strategy is to intervene on many features a little. Then if there are many layers, the penultimate layer containing such high level concepts as fur or wings would be almost as fooled as the output layer, and indeed I would expect the adversary to have more trouble fooling it on such low-level features as edges and dots.