Jacob_Hilton comments on Coordinate-Free Interpretability Theory

Jacob_Hilton 15 Sep 2022 5:26 UTC
LW: 31 AF: 19
16
AF
The notion of a preferred (linear) transformation for interpretability has been called a “privileged basis” in the mechanistic interpretability literature. See for example Softmax Linear Units, where the idea is discussed at length.
In practice, the typical reason to expect a privileged basis is in fact SGD – or more precisely, the choice of architecture. Specifically, activation functions such as ReLU often privilege the standard basis. I would not generally expect the data or the initialization to privilege any basis beyond the start of the network or the start of training. The data may itself have a privileged basis, but this should be lost as soon as the first linear layer is reached. The initialization is usually Gaussian and hence isotropic anyway, but if it did have a privileged basis I would also expect this to be quickly lost without some other reason to hold onto it.
- johnswentworth 15 Sep 2022 16:39 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Yeah, I’m familiar with privileged bases. Once we generalize to a whole privileged coordinate system, the RELUs are no longer enough.
  Isotropy of the initialization distribution still applies, but the key is that we only get to pick one rotation for the parameters, and that same rotation has to be used for all data points. That constraint is baked in to the framing when thinking about privileged bases, but it has to be derived when thinking about privileged coordinate systems.
- tailcalled 15 Sep 2022 7:46 UTC
  LW: 3 AF: 2
  0
  AF Parent
  
  The data may itself have a privileged basis, but this should be lost as soon as the first linear layer is reached.
  
  Not totally lost if the layer is e.g. a convolutional layer, because while the pixels within the convolutional window can get arbitrarily scrambled, it is not possible for a convolutional layer to scramble things across different windows in different parts of the picture.
  - Jacob_Hilton 15 Sep 2022 16:03 UTC
    LW: 3 AF: 3
    2
    AF Parent
    Agreed. Likewise, in a transformer, the token dimension should maintain some relationship with the input and output tokens. This is sometimes taken for granted, but it is a good example of the data preferring a coordinate system. My remark that you quoted only really applies to the channel dimension, across which layers typically scramble everything.