Gurkenglas comments on Models, myths, dreams, and Cheshire cat grins

Gurkenglas 25 Jun 2020 21:30 UTC
2 points
Surely, the adversary convinces it this is a pig by convincing it that it has fur and no wings? I don’t have experience in how it works on the inside, but if the adversary can magically intervene on each neuron, changing its output by d by investing d² effort, then the proper strategy is to intervene on many features a little. Then if there are many layers, the penultimate layer containing such high level concepts as fur or wings would be almost as fooled as the output layer, and indeed I would expect the adversary to have more trouble fooling it on such low-level features as edges and dots.