DaemonicSigil comments on And All the Shoggoths Merely Players

DaemonicSigil 11 Feb 2024 20:17 UTC
6 points
8

Relatedly, you bring up adversarial examples in a way that suggests that you think of them as defects of a primitive optimization paradigm, but it turns out that adversarial examples often correspond to predictively useful features that the network is actively using for classification, despite those features not being robust to pixel-level perturbations that humans don’t notice—which I guess you could characterize as “weird squiggles” from our perspective, but the etiology of the squiggles presents a much more optimistic story about fixing the problem with adversarial training than if you thought “squiggles” were an inevitable consequence of using conventional ML techniques.

Train two distinct classifier neural-nets on an image dataset. Set aside one as the “reference net”. The other net will be the “target net”. Now perturb the images so that they look the same to humans, and also get classified the same by the reference net. So presumably both the features humans use to classify, and the squiggly features that neural nets use should be mostly unchanged. Under these constraints on the perturbation, I bet that it will still be possible to perturb images to produce adversarial examples for the target net.

Literally. I will bet money that I can still produce adversarial examples under such constraints if anyone wants to take me up on it.