nostalgebraist comments on interpreting GPT: the logit lens

nostalgebraist 31 Aug 2020 23:57 UTC
2 points
Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...
Not sure I understand the distinction, could you rephrase?
If by “last slot” you mean last layer (as opposed to earlier layers), that seems like the same thing as outputting the input offset by one.
If by “last slot” you mean the token N+1 given tokens (1, 2, … N), then no, that’s not how GPT works. If you put in tokens (1, 2, … N), you always get guesses for tokens (2, 3, …, N+1) in response. This is true even if all you care about is the guess for N+1.
- Gurkenglas 1 Sep 2020 14:43 UTC
  4 points
  Parent
  I meant your latter interpretation.
  Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
  GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values. If we used an activation function that’s linear on small values, I would therefore expect more of the calculation to be visible.
  - nostalgebraist 1 Sep 2020 15:19 UTC
    4 points
    Parent
    Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
    One can do this in the Colab notebook by calling show_token_progress with comparisons_vs="first" rather than the default "final". IIRC, this also shows a discontinuous flip at the bottom followed by slower change.
    (This is similar to asking the question “do the activations assign high or low probability the input token?” One can answer the same question by plotting logits or ranks with the input layer included.)
    GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values.
    It uses gelu, but gelu has the same property. However, note that I am extracting activations right after the application of a layer norm operation, which shifts/scales the activations to mean 0 and L2 norm 1 before passing them to the next layer.
    - Gurkenglas 29 Apr 2021 17:16 UTC
      2 points
      Parent
      gelu has the same property
      Actually, gelu is differentiable at 0, so it is linear on close-to-zero values.
      - nostalgebraist 1 May 2021 1:27 UTC
        2 points
        Parent
        Ah, I think we miscommunicated.
        I meant “gelu(x) achieves its maximum curvature somewhere near x=0.”
        People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|. In this sense gelu is like relu.
        It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.
        This is indeed unique to relu—I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops. Obviously you can’t simulate a differentiable function with that trick.
        gwern 1 May 2021 2:17 UTC
        4 points
        Parent
        (OpenAI?)
        Gurkenglas 1 May 2021 8:53 UTC
        2 points
        Parent
        floating point underflow to simulate relu
        Oh that’s not good. Looks like we’d need a version of float that keeps track of an interval of possible floats (by the two floats at the end of the interval). Then we could simulate the behavior of infinite-precision floats so long as the network keeps the bounds tight, and we could train the network to keep the simulation in working order. Then we could see whether, in a network thus linear at small numbers, every visibly large effect has a visibly large cause.
        By the way—have you seen what happens when you finetune GPT to reinforce this pattern that you’re observing, that every entry of the table, not just the top right one, predicts an input token?
    - algon33 1 Sep 2020 20:19 UTC
      2 points
      Parent
      IIRC, this also shows a discontinuous flip at the bottom followed by slower change.
      Maybe edit the post so you include this? I know I was wondering about this too.
      - nostalgebraist 1 Sep 2020 21:52 UTC
        2 points
        Parent
        Post has been now updated with a long-ish addendum about this topic.
      - nostalgebraist 1 Sep 2020 21:01 UTC
        2 points
        Parent
        Good idea, I’ll do that.
        I know I’d run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it’s smaller.
        Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like “plasma” are “kept around” for later use.
        Gurkenglas 2 Sep 2020 12:49 UTC
        2 points
        Parent
        Consider also trying the other direction—after all, KL is asymmetric.