DanielFilan comments on Mechanistic Transparency for Machine Learning

DanielFilan 11 Jul 2018 19:36 UTC
LW: 5 AF: 3
AF
Thoughts on challenge 2:
- ‘Smaller’ functions will probably be more human-interpretable, just because they do less, are easy to analyse, and have less weird stuff going on. I think that this implies that as you ‘double-click’ on more high-level primitives, they get more and more interpretable.
- It’s plausible to me that there’s some mathematical theory of how to get things that are human-interpretable enough for our purposes.
- It’s also plausible to me that by trying enough things, you find a method that seems sort of human-interpretable, see what properties it actually has, and check if you can use those.
- There might be synergies with interpretability techniques like neuron visualisation that give you a sense of the input-output behaviour without telling you much about the internal mechanisms.
- If a neural network is well-trained, it’s easier to visualise what each neuron does, because intuitively they need to do sensible things for the outputs to be sensible. You could hope that a similar property for high-level primitives obtains if those primitives are constructed sensibly out of neurons.