Neel Nanda comments on Interpreting Neural Networks through the Polytope Lens

Neel Nanda 23 Sep 2022 21:39 UTC
LW: 14 AF: 8
3
AF
Excited to see this work come out!
One core confusion I have: Transformers apply a LayerNorm every time they read from the residual stream, which scales the vector to have unit norm (ish). If features are represented as features, this is totally fine—it’s the same feature, just rescaled. But if they’re polytopes, and this scaling throws it into a different polytope, this is totally broken. And, importantly, the scaling factor is a global thing about all of the features currently represented by the model, and so is likely pretty hard to control. Shouldn’t this create strong regularisation favouring using meaningful directions over meaningful polytopes?
- Noa Nabeshima 23 Sep 2022 21:53 UTC
  LW: 13 AF: 7
  4
  AF Parent
  I think at least some GPT2 models have a really high-magnitude direction in their residual stream that might be used to preserve some scale information after LayerNorm. [I think Adam Scherlis originally mentioned or showed the direction to me, but maybe someone else?]. It’s maybe akin to the water-droplet artifacts in StyleGAN touched on here: https://arxiv.org/pdf/1912.04958.pdf
  We begin by observing that most images generated by StyleGAN exhibit characteristic blob-shaped artifacts that resemble water droplets. As shown in Figure 1, even when the droplet may not be obvious in the final image, it is present in the intermediate feature maps of the generator.1 The anomaly starts to appear around 64×64 resolution, is present in all feature maps, and becomes progressively stronger at higher resolutions. The existence of such a consistent artifact is puzzling, as the discriminator should be able to detect it. We pinpoint the problem to the AdaIN operation that normalizes the mean and variance of each feature map separately, thereby potentially destroying any information found in the magnitudes of the features relative to each other. We hypothesize that the droplet artifact is a result of the generator intentionally sneaking signal strength information past instance normalization: by creating a strong, localized spike that dominates the statistics, the generator can effectively scale the signal as it likes elsewhere. Our hypothesis is supported by the finding that when the normalization step is removed from the generator, as detailed below, the droplet artifacts disappear completely.
  What links here?
  - cherrvak's comment on Basic Facts about Language Model Internals by beren (23 Feb 2023 19:58 UTC; 1 point)
  - Neel Nanda 24 Sep 2022 13:25 UTC
    LW: 8 AF: 5
    1
    AF Parent
    Interesting, thanks! Like, this lets the model somewhat localise the scaling effect, so there’s not a ton of interference? This seems maybe linked to the results on Emergent Features in the residual stream
- Lee Sharkey 27 Sep 2022 19:02 UTC
  LW: 5 AF: 3
  0
  AF Parent
  Thanks for your interest!
  
  Shouldn’t this create strong regularisation favouring using meaningful directions over meaningful polytopes?
  
  Yes, that seems reasonable!
  
  One thing we want to emphasize is that it’s perfectly possible to have both meaningful directions and meaningful polytopes. For instance, if all polytope boudaries intersect the origin, then all polytopes will be unbounded. In that case, polytopes will essentially be directions!
  
  The polytope lens only becomes relevant when trying to explain what perfectly linear models can’t account for. Although LN might create a bias toward directions, each layer is still nonlinear; nonlinearities probably still need to be accouted for somewhere in our explanations.
  
  All this said, we haven’t thought a lot about LN in this context. It’d be great to know if this regularisation is real and if it’s strong enough that we can reason about networks without thinking about polytopes.
  - Neel Nanda 28 Sep 2022 8:27 UTC
    LW: 5 AF: 3
    2
    AF Parent
    Gotcha, thanks!
    The polytope lens only becomes relevant when trying to explain what perfectly linear models can’t account for. Although LN might create a bias toward directions, each layer is still nonlinear; nonlinearities probably still need to be accouted for somewhere in our explanations.
    Re this, this somewhat conflicts with my understand of the direction lens. The point is not that things are perfectly linear. This point is that we can interpret directions after a non-linear activation function. The non-linearities are used between interpretable spaces to do some transformation mapping meaningful directions to new meaningful directions (and the exact details of how it does this are the circuits to interpret). See, eg, my modular addition work for a very concrete example of this.
    It’s mathematically true that any operation of a ReLU network will be manipulating polytopes (including a randomly initialised network!), and I understood the key claim of this post is that the polytope lens more naturally maps onto interpreting the network and figuring out what’s going on.
    A linear function can never do anything interesting to directions—it just transforms the available space, but cannot create new meaningful directions, just superpositions of the old ones.