I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.
the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately
I don’t think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn’t myopic.
As an example, induction heads make use of previous-token heads. The previous-token head isn’t actually that useful for predicting the output at the current position; it mostly exists to prepare some handy activations so that induction head can look back from a later position and grab them.
So LMs won’t deliberately give bad predictions for the current token if they know a better prediction, but they aren’t putting all of their effort into finding that better prediction.
That’s an important nuance my description left out, thanks. Anything the gradients can reach can be bent to what those gradients serve, so a local token stream’s transformation efforts can indeed be computationally split, even if the output should remain unbiased in expectation.
I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.
I don’t think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn’t myopic.
As an example, induction heads make use of previous-token heads. The previous-token head isn’t actually that useful for predicting the output at the current position; it mostly exists to prepare some handy activations so that induction head can look back from a later position and grab them.
So LMs won’t deliberately give bad predictions for the current token if they know a better prediction, but they aren’t putting all of their effort into finding that better prediction.
That’s an important nuance my description left out, thanks. Anything the gradients can reach can be bent to what those gradients serve, so a local token stream’s transformation efforts can indeed be computationally split, even if the output should remain unbiased in expectation.