Got it, thanks for explaining! So the point is that during training the model has no power over the next token, so there’s no incentive for it to try to influence the world. It could generalize in a way where it tries to e.g. make self-fulfilling prophecies, but that’s not specifically selected for by the training process.
Yup exactly! One way I sometimes find it to helpful to classify systems in terms of the free variables upstream of loss that are optimized during training. In the case of gpt, internal activations are causally upstream of loss for “future” predictions in the same context window, but the output itself is not casually upstream from any effect on loss other than through myopic prediction accuracy (at any one training step) - the ground truth is fixed w/r/t the model’s actions, and autoregressive generation isn’t part of the training game at all.
Got it, thanks for explaining! So the point is that during training the model has no power over the next token, so there’s no incentive for it to try to influence the world. It could generalize in a way where it tries to e.g. make self-fulfilling prophecies, but that’s not specifically selected for by the training process.
Yup exactly! One way I sometimes find it to helpful to classify systems in terms of the free variables upstream of loss that are optimized during training. In the case of gpt, internal activations are causally upstream of loss for “future” predictions in the same context window, but the output itself is not casually upstream from any effect on loss other than through myopic prediction accuracy (at any one training step) - the ground truth is fixed w/r/t the model’s actions, and autoregressive generation isn’t part of the training game at all.