Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...
Not sure I understand the distinction, could you rephrase?
If by “last slot” you mean last layer (as opposed to earlier layers), that seems like the same thing as outputting the input offset by one.
If by “last slot” you mean the token N+1 given tokens (1, 2, … N), then no, that’s not how GPT works. If you put in tokens (1, 2, … N), you always get guesses for tokens (2, 3, …, N+1) in response. This is true even if all you care about is the guess for N+1.
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values. If we used an activation function that’s linear on small values, I would therefore expect more of the calculation to be visible.
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
One can do this in the Colab notebook by calling show_token_progress with comparisons_vs="first" rather than the default "final". IIRC, this also shows a discontinuous flip at the bottom followed by slower change.
(This is similar to asking the question “do the activations assign high or low probability the input token?” One can answer the same question by plotting logits or ranks with the input layer included.)
GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values.
It uses gelu, but gelu has the same property. However, note that I am extracting activations right after the application of a layer norm operation, which shifts/scales the activations to mean 0 and L2 norm 1 before passing them to the next layer.
I meant “gelu(x) achieves its maximum curvature somewhere near x=0.”
People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|. In this sense gelu is like relu.
It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.
This is indeed unique to relu—I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops. Obviously you can’t simulate a differentiable function with that trick.
Oh that’s not good. Looks like we’d need a version of float that keeps track of an interval of possible floats (by the two floats at the end of the interval). Then we could simulate the behavior of infinite-precision floats so long as the network keeps the bounds tight, and we could train the network to keep the simulation in working order. Then we could see whether, in a network thus linear at small numbers, every visibly large effect has a visibly large cause.
By the way—have you seen what happens when you finetune GPT to reinforce this pattern that you’re observing, that every entry of the table, not just the top right one, predicts an input token?
I know I’d run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it’s smaller.
Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like “plasma” are “kept around” for later use.
Not sure I understand the distinction, could you rephrase?
If by “last slot” you mean last layer (as opposed to earlier layers), that seems like the same thing as outputting the input offset by one.
If by “last slot” you mean the token N+1 given tokens (1, 2, … N), then no, that’s not how GPT works. If you put in tokens (1, 2, … N), you always get guesses for tokens (2, 3, …, N+1) in response. This is true even if all you care about is the guess for N+1.
I meant your latter interpretation.
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values. If we used an activation function that’s linear on small values, I would therefore expect more of the calculation to be visible.
One can do this in the Colab notebook by calling
show_token_progress
withcomparisons_vs="first"
rather than the default"final"
. IIRC, this also shows a discontinuous flip at the bottom followed by slower change.(This is similar to asking the question “do the activations assign high or low probability the input token?” One can answer the same question by plotting logits or ranks with the input layer included.)
It uses gelu, but gelu has the same property. However, note that I am extracting activations right after the application of a layer norm operation, which shifts/scales the activations to mean 0 and L2 norm 1 before passing them to the next layer.
Actually, gelu is differentiable at 0, so it is linear on close-to-zero values.
Ah, I think we miscommunicated.
I meant “gelu(x) achieves its maximum curvature somewhere near x=0.”
People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|. In this sense gelu is like relu.
It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.
This is indeed unique to relu—I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops. Obviously you can’t simulate a differentiable function with that trick.
(OpenAI?)
Oh that’s not good. Looks like we’d need a version of float that keeps track of an interval of possible floats (by the two floats at the end of the interval). Then we could simulate the behavior of infinite-precision floats so long as the network keeps the bounds tight, and we could train the network to keep the simulation in working order. Then we could see whether, in a network thus linear at small numbers, every visibly large effect has a visibly large cause.
By the way—have you seen what happens when you finetune GPT to reinforce this pattern that you’re observing, that every entry of the table, not just the top right one, predicts an input token?
Maybe edit the post so you include this? I know I was wondering about this too.
Post has been now updated with a long-ish addendum about this topic.
Good idea, I’ll do that.
I know I’d run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it’s smaller.
Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like “plasma” are “kept around” for later use.
Consider also trying the other direction—after all, KL is asymmetric.