tailcalled comments on Coordinate-Free Interpretability Theory

tailcalled 15 Sep 2022 7:52 UTC
5 points
2

One possible answer: the data. We’ve implicitly assumed that we can apply arbitrary coordinate transformations to the data, but that doesn’t necessarily make sense. Something like a stream of text or an image does have a bunch of meaningful structure in it (like e.g. nearby-ness of two pixels in an image) which would be lost under arbitrary transformations. So one natural next step is to allow coordinate preference to be inherited from the data. On the other hand, we’d be importing our own knowledge of structure in the data; really, we’d prefer to only use the knowledge learned by the net.

It’s worth remembering that we often import our own knowledge of the data in when designing the nets too, e.g. convolutional layers for image processing respect exactly this kind of locality.

Also, one piece of structure that will always be present in the data regardless if it is images or text or whatever is that it is separated into data points. So one could e.g. think about whether there’s a low-dimensional summary or tangent space for each data point that describes the network’s behavior on it in an interpretable way (though one difficulty with this is that networks are typically not robust, so even tiny changes could completely change the classification).
- johnswentworth 15 Sep 2022 16:46 UTC
  2 points
  0
  Parent
  Yup, I’m ideally hoping for a framework which automatically rediscovers any architectural features like that. For instance, one reason I think the parameter-sensitivity thing is promising is that it can automatically highlight architectural sparsity patterns, like e.g. the sort induced by convolutional layers.
  - tailcalled 15 Sep 2022 16:55 UTC
    2 points
    0
    Parent
    I think one major challenge with convolutions is that they are translation-invariant. It’s not just an architectural sparsity pattern, the sparsity pattern also has a huge number of symmetries. But automatically discovering those symmetries seems difficult in general.
    
    (And this gets even more difficult when the symmetries only make sense from a bigger picture view, e.g. as I recall Chris Olah discovered 3D symmetries based on perspective, like street going left vs right, but they weren’t enforced architecturally.)