Noa Nabeshima comments on Interpreting Neural Networks through the Polytope Lens

Noa Nabeshima 23 Sep 2022 21:53 UTC
LW: 13 AF: 7
4
AF
I think at least some GPT2 models have a really high-magnitude direction in their residual stream that might be used to preserve some scale information after LayerNorm. [I think Adam Scherlis originally mentioned or showed the direction to me, but maybe someone else?]. It’s maybe akin to the water-droplet artifacts in StyleGAN touched on here: https://arxiv.org/pdf/1912.04958.pdf
We begin by observing that most images generated by StyleGAN exhibit characteristic blob-shaped artifacts that resemble water droplets. As shown in Figure 1, even when the droplet may not be obvious in the final image, it is present in the intermediate feature maps of the generator.1 The anomaly starts to appear around 64×64 resolution, is present in all feature maps, and becomes progressively stronger at higher resolutions. The existence of such a consistent artifact is puzzling, as the discriminator should be able to detect it. We pinpoint the problem to the AdaIN operation that normalizes the mean and variance of each feature map separately, thereby potentially destroying any information found in the magnitudes of the features relative to each other. We hypothesize that the droplet artifact is a result of the generator intentionally sneaking signal strength information past instance normalization: by creating a strong, localized spike that dominates the statistics, the generator can effectively scale the signal as it likes elsewhere. Our hypothesis is supported by the finding that when the normalization step is removed from the generator, as detailed below, the droplet artifacts disappear completely.
What links here?
- cherrvak's comment on Basic Facts about Language Model Internals by beren (23 Feb 2023 19:58 UTC; 1 point)
- Neel Nanda 24 Sep 2022 13:25 UTC
  LW: 8 AF: 5
  1
  AF Parent
  Interesting, thanks! Like, this lets the model somewhat localise the scaling effect, so there’s not a ton of interference? This seems maybe linked to the results on Emergent Features in the residual stream