[Link] Adversarially trained neural representations may already be as robust as corresponding biological neural representations
Abstract:
Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking the neural representations that underlie those systems will yield artificial visual systems that are adversarially robust. In this work, we develop a method for performing adversarial visual attacks directly on primate brain activity. We then leverage this method to demonstrate that the above-mentioned belief might not be well founded. Specifically, we report that the biological neurons that make up visual systems of primates exhibit susceptibility to adversarial perturbations that is comparable in magnitude to existing (robustly trained) artificial neural networks.
Good find. This is a pretty involved experiment. Also, am I interpreting this right and there’s some basic monkey neurons that are particularly highly stimulated by images of pressure gauges? I wonder if their equivalent in us guided the design of the gauges.
Yeah, I was wondering for some time if they worked with real neurons or only models of neurons or something and read on until they explained it in the methods section.
I guess the monkeys don’t see in it what see in it. Maybe it looks like a strange banana to them. Neurons don’t represent words but just concept spaces.
There’s an important caveat here:
I’d be willing to bet that if you give the macaque more than 100ms they’ll get it right—That’s at least how it is for humans!
(Not trying to shift the goalpost, it’s a cool result! Just pointing at the next step.)
Somewhat of a corollary: If there are adversarial pictures for current DNNs there will also be—somewhat different—adversarial pictures for humans.
I don’t think the examples shown are good ones—because for monkeys—but I guess we will soon see ones for us. If we are lucky, there are like the dress. If we are unlucky, they look like something beautiful but act like a scissor statement.
I think they won’t be like the dress, but more like an image that looks like another if you only see it for a fraction of a second. I think neural nets are closer to “a human’s split-second judgement” than their considered judgement.
Wait, doesn’t this imply that value alignment is actually kinda easy?
Please say more. You may be onto something.
The relevant finding from the paper: neurons from a monkey’s inferior temporal cortex are about as vulnerable to being misled by worst-case, adversarially-selected input as are artificial neurons from a ResNet that’s been trained on adversarial examples. In fact, according to Figure 1 from the paper, the artificial neurons were actually more resistant to adversarial perturbations than the monkey’s neurons.
My cerebral cortex is more or less architecturally uniform; the cortical areas that deal with language have the same structure as the areas that deal with movement, with vision, and with abstract planning. Therefore, I estimate that my abstract concepts of “good”, “ethical”, and “utility-maximizing” are about as robust to adversarial perturbations as are my visual concepts of “tree”, “dog”, and “snake”.
Since monkey visual concepts of “dog” and “snake” are about as robust as those of adversarially-trained neural networks, and since I’m just a big hairless monkey, I bet a machine could develop concepts of “good”, “ethical”, and “utility-maximizing” that are just as robust as mine, if not more so.
In figure A, researchers adversarially attack monkey neurons. They start with a picture of a dog, and perturb the images as much as they can (within a specified L2 bound) in order to make those individual monkey neurons fire as if that picture of a dog was actually a picture of a pressure gauge. The bottom-most pair of dog pictures are superstimuli: they make those neurons behave as if they were more pressuregaugelike than any other actual picture of a pressure gauge.
However, that adversarial attack (presumably) only works on that specific monkey. To this hairless monkey, those quirky dog pictures look nothing like pressure gauges. To analogize back to morality: if a superintelligence optimized my sensory environment in order to maximize the firing rate of my brain’s “virtue detector” areas, the result would not look particularly virtuous to you, nor to a neural network that was trained (adversarially) to detect “virtue”. To me. though, my surroundings would feel extremely virtuous.
Conclusion: robust value alignment is so hard, it doesn’t even work on humans.
(What about groups of humans? What if an optimizer optimized with respect to the median firing rate of the “virtue neurons” of a group of 100 people? Would people’s idiosyncratic exploits average out? What if the optimization target was instead the median of an ensemble of adversarially-trained artificial neural nets trained to detect virtue?)
I agree with your reasoning.
A corollary would be that figuring out human value is not enough to make it safe. At least not if you look at the NN stack alone.