Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...
If each layer were trained to give its best guess at the next token, this myopia would prevent all sorts of hiding data for later. This would be a good experiment for your last story, yes? I expect this would perform very poorly, though if it doesn’t, hooray, for I really don’t expect that version to develop inner optimizers.
I think I understand your question and was also confused by this for a bit so I wanted add in some points of clarification. First I want out that I really couldn’t find a satisfactory explanation of this particular detail (at least one that I could understand) so I pieced this together myself from looking at the huggingface code for GPT2. I may get some details wrong.
During training at each step the GPT2 takes in an N tokens and outputs N tokens. But the i-th output token is computed in such away that it only relies on the information from tokens 1, …, i and is meant to predict i+1-th token from these. I think it’s best to think of each output being computed independently of the others (though this isn’t strictly true since the separate outputs are computed by shared matrices). So for each i, we train the network so that the i-th output produces the correct result given the _input_ tokens 1, …, i. There is a term in the loss function for each output token and the total loss is the sum of all the losses of the output tokens. The outputs at other positions do not play a role in the i-th output token, only the first 1,..., i input tokens do.
During inference, given an input of k tokens, we are only concerned with the k-th output token (which should predict the token following the first k). GPT-3 also produces predictions for the outputs before position k but these are just ignored since we already know what these values should be.
Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...
Not sure I understand the distinction, could you rephrase?
If by “last slot” you mean last layer (as opposed to earlier layers), that seems like the same thing as outputting the input offset by one.
If by “last slot” you mean the token N+1 given tokens (1, 2, … N), then no, that’s not how GPT works. If you put in tokens (1, 2, … N), you always get guesses for tokens (2, 3, …, N+1) in response. This is true even if all you care about is the guess for N+1.
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values. If we used an activation function that’s linear on small values, I would therefore expect more of the calculation to be visible.
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
One can do this in the Colab notebook by calling show_token_progress with comparisons_vs="first" rather than the default "final". IIRC, this also shows a discontinuous flip at the bottom followed by slower change.
(This is similar to asking the question “do the activations assign high or low probability the input token?” One can answer the same question by plotting logits or ranks with the input layer included.)
GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values.
It uses gelu, but gelu has the same property. However, note that I am extracting activations right after the application of a layer norm operation, which shifts/scales the activations to mean 0 and L2 norm 1 before passing them to the next layer.
I meant “gelu(x) achieves its maximum curvature somewhere near x=0.”
People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|. In this sense gelu is like relu.
It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.
This is indeed unique to relu—I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops. Obviously you can’t simulate a differentiable function with that trick.
Oh that’s not good. Looks like we’d need a version of float that keeps track of an interval of possible floats (by the two floats at the end of the interval). Then we could simulate the behavior of infinite-precision floats so long as the network keeps the bounds tight, and we could train the network to keep the simulation in working order. Then we could see whether, in a network thus linear at small numbers, every visibly large effect has a visibly large cause.
By the way—have you seen what happens when you finetune GPT to reinforce this pattern that you’re observing, that every entry of the table, not just the top right one, predicts an input token?
I know I’d run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it’s smaller.
Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like “plasma” are “kept around” for later use.
Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...
If each layer were trained to give its best guess at the next token, this myopia would prevent all sorts of hiding data for later. This would be a good experiment for your last story, yes? I expect this would perform very poorly, though if it doesn’t, hooray, for I really don’t expect that version to develop inner optimizers.
I think I understand your question and was also confused by this for a bit so I wanted add in some points of clarification. First I want out that I really couldn’t find a satisfactory explanation of this particular detail (at least one that I could understand) so I pieced this together myself from looking at the huggingface code for GPT2. I may get some details wrong.
During training at each step the GPT2 takes in an N tokens and outputs N tokens. But the i-th output token is computed in such away that it only relies on the information from tokens 1, …, i and is meant to predict i+1-th token from these. I think it’s best to think of each output being computed independently of the others (though this isn’t strictly true since the separate outputs are computed by shared matrices). So for each i, we train the network so that the i-th output produces the correct result given the _input_ tokens 1, …, i. There is a term in the loss function for each output token and the total loss is the sum of all the losses of the output tokens. The outputs at other positions do not play a role in the i-th output token, only the first 1,..., i input tokens do.
During inference, given an input of k tokens, we are only concerned with the k-th output token (which should predict the token following the first k). GPT-3 also produces predictions for the outputs before position k but these are just ignored since we already know what these values should be.
Not sure I understand the distinction, could you rephrase?
If by “last slot” you mean last layer (as opposed to earlier layers), that seems like the same thing as outputting the input offset by one.
If by “last slot” you mean the token N+1 given tokens (1, 2, … N), then no, that’s not how GPT works. If you put in tokens (1, 2, … N), you always get guesses for tokens (2, 3, …, N+1) in response. This is true even if all you care about is the guess for N+1.
I meant your latter interpretation.
Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values. If we used an activation function that’s linear on small values, I would therefore expect more of the calculation to be visible.
One can do this in the Colab notebook by calling
show_token_progress
withcomparisons_vs="first"
rather than the default"final"
. IIRC, this also shows a discontinuous flip at the bottom followed by slower change.(This is similar to asking the question “do the activations assign high or low probability the input token?” One can answer the same question by plotting logits or ranks with the input layer included.)
It uses gelu, but gelu has the same property. However, note that I am extracting activations right after the application of a layer norm operation, which shifts/scales the activations to mean 0 and L2 norm 1 before passing them to the next layer.
Actually, gelu is differentiable at 0, so it is linear on close-to-zero values.
Ah, I think we miscommunicated.
I meant “gelu(x) achieves its maximum curvature somewhere near x=0.”
People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|. In this sense gelu is like relu.
It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.
This is indeed unique to relu—I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops. Obviously you can’t simulate a differentiable function with that trick.
(OpenAI?)
Oh that’s not good. Looks like we’d need a version of float that keeps track of an interval of possible floats (by the two floats at the end of the interval). Then we could simulate the behavior of infinite-precision floats so long as the network keeps the bounds tight, and we could train the network to keep the simulation in working order. Then we could see whether, in a network thus linear at small numbers, every visibly large effect has a visibly large cause.
By the way—have you seen what happens when you finetune GPT to reinforce this pattern that you’re observing, that every entry of the table, not just the top right one, predicts an input token?
Maybe edit the post so you include this? I know I was wondering about this too.
Post has been now updated with a long-ish addendum about this topic.
Good idea, I’ll do that.
I know I’d run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it’s smaller.
Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like “plasma” are “kept around” for later use.
Consider also trying the other direction—after all, KL is asymmetric.