The relevant finding from the paper: neurons from a monkey’s inferior temporal cortex are about as vulnerable to being misled by worst-case, adversarially-selected input as are artificial neurons from a ResNet that’s been trained on adversarial examples. In fact, according to Figure 1 from the paper, the artificial neurons were actually more resistant to adversarial perturbations than the monkey’s neurons.
My cerebral cortex is more or less architecturally uniform; the cortical areas that deal with language have the same structure as the areas that deal with movement, with vision, and with abstract planning. Therefore, I estimate that my abstract concepts of “good”, “ethical”, and “utility-maximizing” are about as robust to adversarial perturbations as are my visual concepts of “tree”, “dog”, and “snake”.
Since monkey visual concepts of “dog” and “snake” are about as robust as those of adversarially-trained neural networks, and since I’m just a big hairless monkey, I bet a machine could develop concepts of “good”, “ethical”, and “utility-maximizing” that are just as robust as mine, if not more so.
In figure A, researchers adversarially attack monkey neurons. They start with a picture of a dog, and perturb the images as much as they can (within a specified L2 bound) in order to make those individual monkey neurons fire as if that picture of a dog was actually a picture of a pressure gauge. The bottom-most pair of dog pictures are superstimuli: they make those neurons behave as if they were more pressuregaugelike than any other actual picture of a pressure gauge.
However, that adversarial attack (presumably) only works on that specific monkey. To this hairless monkey, those quirky dog pictures look nothing like pressure gauges. To analogize back to morality: if a superintelligence optimized my sensory environment in order to maximize the firing rate of my brain’s “virtue detector” areas, the result would not look particularly virtuous to you, nor to a neural network that was trained (adversarially) to detect “virtue”. To me. though, my surroundings would feel extremely virtuous.
Conclusion: robust value alignment is so hard, it doesn’t even work on humans.
(What about groups of humans? What if an optimizer optimized with respect to the median firing rate of the “virtue neurons” of a group of 100 people? Would people’s idiosyncratic exploits average out? What if the optimization target was instead the median of an ensemble of adversarially-trained artificial neural nets trained to detect virtue?)
Wait, doesn’t this imply that value alignment is actually kinda easy?
Please say more. You may be onto something.
The relevant finding from the paper: neurons from a monkey’s inferior temporal cortex are about as vulnerable to being misled by worst-case, adversarially-selected input as are artificial neurons from a ResNet that’s been trained on adversarial examples. In fact, according to Figure 1 from the paper, the artificial neurons were actually more resistant to adversarial perturbations than the monkey’s neurons.
My cerebral cortex is more or less architecturally uniform; the cortical areas that deal with language have the same structure as the areas that deal with movement, with vision, and with abstract planning. Therefore, I estimate that my abstract concepts of “good”, “ethical”, and “utility-maximizing” are about as robust to adversarial perturbations as are my visual concepts of “tree”, “dog”, and “snake”.
Since monkey visual concepts of “dog” and “snake” are about as robust as those of adversarially-trained neural networks, and since I’m just a big hairless monkey, I bet a machine could develop concepts of “good”, “ethical”, and “utility-maximizing” that are just as robust as mine, if not more so.
In figure A, researchers adversarially attack monkey neurons. They start with a picture of a dog, and perturb the images as much as they can (within a specified L2 bound) in order to make those individual monkey neurons fire as if that picture of a dog was actually a picture of a pressure gauge. The bottom-most pair of dog pictures are superstimuli: they make those neurons behave as if they were more pressuregaugelike than any other actual picture of a pressure gauge.
However, that adversarial attack (presumably) only works on that specific monkey. To this hairless monkey, those quirky dog pictures look nothing like pressure gauges. To analogize back to morality: if a superintelligence optimized my sensory environment in order to maximize the firing rate of my brain’s “virtue detector” areas, the result would not look particularly virtuous to you, nor to a neural network that was trained (adversarially) to detect “virtue”. To me. though, my surroundings would feel extremely virtuous.
Conclusion: robust value alignment is so hard, it doesn’t even work on humans.
(What about groups of humans? What if an optimizer optimized with respect to the median firing rate of the “virtue neurons” of a group of 100 people? Would people’s idiosyncratic exploits average out? What if the optimization target was instead the median of an ensemble of adversarially-trained artificial neural nets trained to detect virtue?)
I agree with your reasoning.
A corollary would be that figuring out human value is not enough to make it safe. At least not if you look at the NN stack alone.