Depends on what you mean by “sacrificing some loss on the current token if that made the following token easier to predict”.
The transformer architecture in particular is incentivized to do internal computations which help its future self predict future tokens when those activations are looked up by attention, as a joint objective to myopic next token prediction. This might entail sacrificing next token prediction accuracy as a consequence of not optimizing purely for that. (this is why I said in footnote 26 that transformers aren’t perfectly myopic in a sense)
But there aren’t training incentives for the model to prefer certain predictions because of the consequences if the sampled token were to be inserted into the stream of text, e.g. making subsequent text easier to predict if the rest of the text were to continue as expected given that token is in the sequence, because its predictions has no influence on the ground truth it has to predict during training. (For the same reason there’s no direct incentive for GPT to fix behaviors that chain into bad multi step predictions when it generates text that’s fed back into itself, like looping)
Training incentives are just training incentives though, not strict constraints on the model’s computation, and our current level of insight gives us no guarantee that models like GPT actually don’t/won’t care about the causal impact of its decoded predictions to any end, including affecting easiness of future predictions. Maybe there are arguments why we should expect it to develop this kind of mesaobjective over another, but I’m not aware of any convincing ones.
Got it, thanks for explaining! So the point is that during training the model has no power over the next token, so there’s no incentive for it to try to influence the world. It could generalize in a way where it tries to e.g. make self-fulfilling prophecies, but that’s not specifically selected for by the training process.
Yup exactly! One way I sometimes find it to helpful to classify systems in terms of the free variables upstream of loss that are optimized during training. In the case of gpt, internal activations are causally upstream of loss for “future” predictions in the same context window, but the output itself is not casually upstream from any effect on loss other than through myopic prediction accuracy (at any one training step) - the ground truth is fixed w/r/t the model’s actions, and autoregressive generation isn’t part of the training game at all.
Depends on what you mean by “sacrificing some loss on the current token if that made the following token easier to predict”.
The transformer architecture in particular is incentivized to do internal computations which help its future self predict future tokens when those activations are looked up by attention, as a joint objective to myopic next token prediction. This might entail sacrificing next token prediction accuracy as a consequence of not optimizing purely for that. (this is why I said in footnote 26 that transformers aren’t perfectly myopic in a sense)
But there aren’t training incentives for the model to prefer certain predictions because of the consequences if the sampled token were to be inserted into the stream of text, e.g. making subsequent text easier to predict if the rest of the text were to continue as expected given that token is in the sequence, because its predictions has no influence on the ground truth it has to predict during training. (For the same reason there’s no direct incentive for GPT to fix behaviors that chain into bad multi step predictions when it generates text that’s fed back into itself, like looping)
Training incentives are just training incentives though, not strict constraints on the model’s computation, and our current level of insight gives us no guarantee that models like GPT actually don’t/won’t care about the causal impact of its decoded predictions to any end, including affecting easiness of future predictions. Maybe there are arguments why we should expect it to develop this kind of mesaobjective over another, but I’m not aware of any convincing ones.
Got it, thanks for explaining! So the point is that during training the model has no power over the next token, so there’s no incentive for it to try to influence the world. It could generalize in a way where it tries to e.g. make self-fulfilling prophecies, but that’s not specifically selected for by the training process.
Yup exactly! One way I sometimes find it to helpful to classify systems in terms of the free variables upstream of loss that are optimized during training. In the case of gpt, internal activations are causally upstream of loss for “future” predictions in the same context window, but the output itself is not casually upstream from any effect on loss other than through myopic prediction accuracy (at any one training step) - the ground truth is fixed w/r/t the model’s actions, and autoregressive generation isn’t part of the training game at all.