Bunthut comments on The Case for Radical Optimism about Interpretability

Bunthut 22 Dec 2021 22:22 UTC
1 point
I don’t think the analogy to biological brains is quite as strong. For example, biological brains need to be “robust” not only to variations in the input, but also in a literal sense, to forceful impact or to parasites trying to control it. It intentionally has very bad suppressability, and this means there needs to be a lot of redundancy, which makes “just stick an electrode in that area” work. More generally, it is under many constraints that a ML system isn’t, probably too many for us to think of, and it generally prioritizes safety over performance. Both lead away from the sort of maximally efficient compression that makes ML systems hard to interpret.
Analogously: Imagine a programmer would write the shortest program that does a given task. That would be terrible. It would be impossible to change anything without essentially redesigning everything, and trying to understand what it does just from reading the code would be very hard, and giving a compressed explanation of how it does that would be impossible. In practice, we don’t write code like that, because we face constraints like those mentioned above—but its very easy to imagine that some optimization-based “automatic coder” would program like that. Indeed, on the occasion that we need to really optimize runtimes, we move in that direction ourselves.
So I don’t think brains tell us much about the interpretability of the standard, highly optimized neural nets.
- Quintin Pope 24 Dec 2021 2:46 UTC
  2 points
  Parent
  Firstly, thank you for your comment. I’m always glad to have good faith engagement on this topic.
  However, I think you’re assuming the worst case scenario in regards to interpretability.
  biological brains need to be “robust” not only to variations in the input, but also in a literal sense, to forceful impact
  Artificial networks are often trained with randomness applied to their internal states (dropout, gradient noise, etc). These seem like they’d cause more internal disruption (and are FAR more common) than occasional impacts.
  or to parasites trying to control it
  Evolved resistance to parasite control seems like it should decrease interpretability, if anything. E.g., having multiple reward centers that are easily activated is a terrible idea for resisting parasite control. And yet, the brain does it anyways.
  It intentionally has very bad suppressability
  No it doesn’t. One of my own examples of brain interpretability was:
  A recent paper was able to reversibly disable conditioned fear responses from mice.
  Which is a type of suppressability. Various forms of amnesia, aphasia and other perceptual blocks are other examples of suppressability in one form or another.
  it generally prioritizes safety over performance
  So more reliable ML systems will be more interpretable? Seems like a reason for optimism.
  Imagine a programmer would write the shortest program that does a given task. That would be terrible. It would be impossible to change anything without essentially redesigning everything
  Yes, it would be terrible. I think you’ve identified a major reason why larger neural nets learn more quickly than smaller nets. The entire point of training neural nets is that you constantly change each part of the net to better process the data. The internal representations of different circuits have to be flexible, robust and mutually interpretable to other circuits. Otherwise, the circuits won’t be able to co-develop quickly.
  One part of Knowledge Neurons in Pretrained Transformers I didn’t include (but probably should have) is the fact that transformers re-use their input embedding in their internal knowledge representations. I.e., the feed forward layers that push the network to output target tokens represent their target token using the input embeddings of the target tokens in question. This would be very surprising if you thought that the network was just trying to represent its internal states using the shortest possible encoding. However, it’s very much what you’d expect if you thought the internal representations needed to be easily legible across many different parts of the network.
  I’ll present a more extensive exploration of the “inner interpretability problem” in a future post arguing that there are deep theoretical reasons why easy to optimize systems also tend to be interpretable, and that highly optimized neural nets will learn to make themselves more interpretable.