Gurkenglas comments on Comments on OpenPhil’s Interpretability RFP

Gurkenglas Nov 9, 2021, 12:33 AM
2 points
If you can find two halves with little mutual information, you can understand one before having understood the other. I suspect that interpreting a model should be decomposed by hierarchically clustering neurons using such measurements. Since the measurement is differentiable, you can train a network for modularity to make this work better.
It sure is similar to feature visualization! I prefer it because it doesn’t go out of distribution and doesn’t feel like it implicitly assumes that the model implements a linear function.
I agree that interpretability is the purpose and the cure.