I don’t think I understand the problem correctly, but let me try to rephrase this. I believe the key part is the claim whether or not ChatGPT has a global plan? Let’s say we run ChatGPT one output at a time, every time appending the output token to the current prompt and calculating the next output. This ignores some beam search shenanigans that may be useful in practice, but I don’t think that’s the core issue here.
There is no memory between calculating the first and second token. The first time you give ChatGPT the sequence “Once upon a” and it predicts “time” and you can shut down the machine, the next time you give it “Once upon a time” and it predicts the next word. So there isn’t any global plan in a very strict sense.
However when you put “Once upon a time” into a transformer, it will actually reproduce the exact values from the “Once upon a” run, in addition to a new set of values for the next token. Internally, you have a column of residual stream for every word (with 400 or so rows aka layers each), and the first four rows are identical between the two runs. So you could say that ChatGPT reconstructs* a plan every time it’s asked to output a next token. It comes up with a plan every single time you call it. And the first N columns of the plan are identical to the previous plan, and with every new word you add a column of plan. So in that sense there is a global plan to speak of, but this also fits within the framework of predicting the next token.
“Hey ChatGPT predict the next word!” --> ChatGPT looks at the text, comes up with a plan, and predicts the next word accordingly. Then it forgets everything, but the next time you give it the same text + one more word, it comes up with the same plan + a little bit extra, and so on.
Regarding ‘If ChatGPT visits every parameter each time it generates a token, that sure looks “global” to me.’ I am not sure what you mean with this. I think an important note is to keep in mind it uses the same parameters for every “column”, for every word. There is no such thing as ChatGPT not visiting every parameter.
And please correct me if I understood any of this wrongly!
*in practice people cache those intermediate computation results somewhere in their GPU memory to not have to recompute those internal values every time. But it’s equivalent to recomputing them, and the latter has less complications to reason about.
I don’t think I understand the problem correctly, but let me try to rephrase this. I believe the key part is the claim whether or not ChatGPT has a global plan? Let’s say we run ChatGPT one output at a time, every time appending the output token to the current prompt and calculating the next output. This ignores some beam search shenanigans that may be useful in practice, but I don’t think that’s the core issue here.
There is no memory between calculating the first and second token. The first time you give ChatGPT the sequence “Once upon a” and it predicts “time” and you can shut down the machine, the next time you give it “Once upon a time” and it predicts the next word. So there isn’t any global plan in a very strict sense.
However when you put “Once upon a time” into a transformer, it will actually reproduce the exact values from the “Once upon a” run, in addition to a new set of values for the next token. Internally, you have a column of residual stream for every word (with 400 or so rows aka layers each), and the first four rows are identical between the two runs. So you could say that ChatGPT reconstructs* a plan every time it’s asked to output a next token. It comes up with a plan every single time you call it. And the first N columns of the plan are identical to the previous plan, and with every new word you add a column of plan. So in that sense there is a global plan to speak of, but this also fits within the framework of predicting the next token.
“Hey ChatGPT predict the next word!” --> ChatGPT looks at the text, comes up with a plan, and predicts the next word accordingly. Then it forgets everything, but the next time you give it the same text + one more word, it comes up with the same plan + a little bit extra, and so on.
Regarding ‘If ChatGPT visits every parameter each time it generates a token, that sure looks “global” to me.’ I am not sure what you mean with this. I think an important note is to keep in mind it uses the same parameters for every “column”, for every word. There is no such thing as ChatGPT not visiting every parameter.
And please correct me if I understood any of this wrongly!
*in practice people cache those intermediate computation results somewhere in their GPU memory to not have to recompute those internal values every time. But it’s equivalent to recomputing them, and the latter has less complications to reason about.