Quintin Pope comments on The Case for Radical Optimism about Interpretability

Quintin Pope 24 Dec 2021 2:46 UTC
2 points
Firstly, thank you for your comment. I’m always glad to have good faith engagement on this topic.
However, I think you’re assuming the worst case scenario in regards to interpretability.
biological brains need to be “robust” not only to variations in the input, but also in a literal sense, to forceful impact
Artificial networks are often trained with randomness applied to their internal states (dropout, gradient noise, etc). These seem like they’d cause more internal disruption (and are FAR more common) than occasional impacts.
or to parasites trying to control it
Evolved resistance to parasite control seems like it should decrease interpretability, if anything. E.g., having multiple reward centers that are easily activated is a terrible idea for resisting parasite control. And yet, the brain does it anyways.
It intentionally has very bad suppressability
No it doesn’t. One of my own examples of brain interpretability was:
A recent paper was able to reversibly disable conditioned fear responses from mice.
Which is a type of suppressability. Various forms of amnesia, aphasia and other perceptual blocks are other examples of suppressability in one form or another.
it generally prioritizes safety over performance
So more reliable ML systems will be more interpretable? Seems like a reason for optimism.
Imagine a programmer would write the shortest program that does a given task. That would be terrible. It would be impossible to change anything without essentially redesigning everything
Yes, it would be terrible. I think you’ve identified a major reason why larger neural nets learn more quickly than smaller nets. The entire point of training neural nets is that you constantly change each part of the net to better process the data. The internal representations of different circuits have to be flexible, robust and mutually interpretable to other circuits. Otherwise, the circuits won’t be able to co-develop quickly.
One part of Knowledge Neurons in Pretrained Transformers I didn’t include (but probably should have) is the fact that transformers re-use their input embedding in their internal knowledge representations. I.e., the feed forward layers that push the network to output target tokens represent their target token using the input embeddings of the target tokens in question. This would be very surprising if you thought that the network was just trying to represent its internal states using the shortest possible encoding. However, it’s very much what you’d expect if you thought the internal representations needed to be easily legible across many different parts of the network.
I’ll present a more extensive exploration of the “inner interpretability problem” in a future post arguing that there are deep theoretical reasons why easy to optimize systems also tend to be interpretable, and that highly optimized neural nets will learn to make themselves more interpretable.