“Activation space gradient descent” sounds a lot like what the predictive coding framework is all about. Basically, you compare the top-down predictions of a generative model against the bottom-up perceptions of an encoder (or against the low-level inputs themselves) to create a prediction error. This error signal is sent back up to modify the activations of the generative model, minimizing future prediction errors.
From what I know of Transformer models, it’s hard to tell exactly where this prediction error would be generated. Perhaps during few-shot learning, the model does an internal next-token prediction at every point along its input, comparing what it predicts the next token should be (based on the task it currently thinks it’s doing) against what the next token actually is. The resulting prediction error is fed “back” to the predictive model by being passed forward (via self-attention) to the next example in the input text, biasing the way it predicts next tokens in a way that would have given a lower error on the first example.
None of these predictions and errors would be visible unless you fed the input one token at a time and forced the hidden states to match what they were for the full input. A recurrent version of GPT might make that easier.
It would be interesting to see whether you could create a language model that had predictive coding built explicitly into its architecture, where internal predictions, error signals, etc. are all tracked at known locations within the model. I expect that interpretability would become a simpler task.
Here’s a sketch of the predictive-coding-inspired model I think you propose:
The initial layer predicts token i+1 from token i for all tokens. The job of each “predictive coding” layer would be to read all the true tokens and predictions from the residual streams, find the error between the prediction and the ground truth, then make a uniform update to all tokens to correct those errors. As in the dual form of gradient descent, where updating all the training data to be closer to a random model also allows you to update a test output to be closer to the output of a trained model, updating all the predicted tokens uniformly also moves prediction n+1 closer to the true token n+1. At the end, an output layer reads the prediction for n+1 out of the latent stream of token n.
This would be a cool way for language models to work:
it puts next-token-prediction first and foremost, which is what we would expect for a model trained on next-token-prediction.
it’s an intuitive framing for people familiar with making iterative updates to models / predictions
it’s very interpretable, at each step we can read off the model’s current prediction from the latent stream of the final token (and because the architecture is horizontally homogenous, we can read off the model’s “predictions” for mid-sequence tokens too, though as you say they wouldn’t be quite the same as the predictions you would get for truncated sequences).
But we have no idea if GPT works like this! I haven’t checked if GPT has any circuits that fit this form; from what I’ve read of the Transformer Circuits sequence they don’t seem to have found predicted tokens in the residual streams. The activation space gradient descent theory is equally compelling, and equally unproven. Someone (you? me? anthropic?) should poke around in the weights of an LLM and see if they can find something that looks like this.
Interesting, iterative attention mechanisms had always reminded me of predictive coding, where cross-attention encodes a kind of prediction error between the latent and data. But I could also see how self-attention could be read as a type of prediction error between tokens {0,...,n} and {1,...,n+1}.
There is some work comparing residual connections and iterative inference that may be of relevance; they show that such architectures “naturally encourage features to move along the negative gradient of loss during the feedforward phase”, I expect some of these insights could be applied to the residual stream in transformers.
Don’t we have some evidence GPTs are doing iterative prediction updating from the logit lens and later tuned lens? Not that that’s all they’re doing of course.
I’m not sure the tuned lens indicates that the model is doing iterative prediction; it shows that if for each layer in the model you train a linear classifier to predict the next token embedding from the activations, as you progress through the model the linear classifiers get more and more accurate. But that’s what we’d expect from any model, regardless of whether it was doing iterative prediction; each layer uses the features from the previous layer to calculate features that are more useful in the next layer. The inception network analysed in the distill.ai circuits thread starts by computing lines and gradients, then curves, then circles, then eyes, then faces, etc. Predicting the class from the presence of faces will be easier than from the presence of lines and gradients, so if you trained a tuned lens on inception v1 it would have the same pattern—lenses from later layers would have lower perplexity. I think to really show iterative prediction, you would have to be able to use the same lens for every layer; that would show that there is some consistent representation of the prediction being updated with each layer.
Here’s the relevant figure from the tuned lens—the transfer penalties for using a lens from one layer on another layer are small but meaningfully non-zero, and tend to increase the further away the layers are in the model. That they are small is suggestive that GPT might be doing something like iterative prediction, but the evidence isn’t compelling enough for my taste.
Thanks for the insightful response! Agree it’s just suggestive for now. Though more then with image models (where I’d expect lenses to transfer really badly, but don’t know). Perhaps it being a residual network is the key thing, since effective path lengths are low most of the information is “carried along” unchanged, meaning the same probe continues working for other layers. Idk
“Activation space gradient descent” sounds a lot like what the predictive coding framework is all about. Basically, you compare the top-down predictions of a generative model against the bottom-up perceptions of an encoder (or against the low-level inputs themselves) to create a prediction error. This error signal is sent back up to modify the activations of the generative model, minimizing future prediction errors.
From what I know of Transformer models, it’s hard to tell exactly where this prediction error would be generated. Perhaps during few-shot learning, the model does an internal next-token prediction at every point along its input, comparing what it predicts the next token should be (based on the task it currently thinks it’s doing) against what the next token actually is. The resulting prediction error is fed “back” to the predictive model by being passed forward (via self-attention) to the next example in the input text, biasing the way it predicts next tokens in a way that would have given a lower error on the first example.
None of these predictions and errors would be visible unless you fed the input one token at a time and forced the hidden states to match what they were for the full input. A recurrent version of GPT might make that easier.
It would be interesting to see whether you could create a language model that had predictive coding built explicitly into its architecture, where internal predictions, error signals, etc. are all tracked at known locations within the model. I expect that interpretability would become a simpler task.
Here’s a sketch of the predictive-coding-inspired model I think you propose:
The initial layer predicts token i+1 from token i for all tokens. The job of each “predictive coding” layer would be to read all the true tokens and predictions from the residual streams, find the error between the prediction and the ground truth, then make a uniform update to all tokens to correct those errors. As in the dual form of gradient descent, where updating all the training data to be closer to a random model also allows you to update a test output to be closer to the output of a trained model, updating all the predicted tokens uniformly also moves prediction n+1 closer to the true token n+1. At the end, an output layer reads the prediction for n+1 out of the latent stream of token n.
This would be a cool way for language models to work:
it puts next-token-prediction first and foremost, which is what we would expect for a model trained on next-token-prediction.
it’s an intuitive framing for people familiar with making iterative updates to models / predictions
it’s very interpretable, at each step we can read off the model’s current prediction from the latent stream of the final token (and because the architecture is horizontally homogenous, we can read off the model’s “predictions” for mid-sequence tokens too, though as you say they wouldn’t be quite the same as the predictions you would get for truncated sequences).
But we have no idea if GPT works like this! I haven’t checked if GPT has any circuits that fit this form; from what I’ve read of the Transformer Circuits sequence they don’t seem to have found predicted tokens in the residual streams. The activation space gradient descent theory is equally compelling, and equally unproven. Someone (you? me? anthropic?) should poke around in the weights of an LLM and see if they can find something that looks like this.
Interesting, iterative attention mechanisms had always reminded me of predictive coding, where cross-attention encodes a kind of prediction error between the latent and data. But I could also see how self-attention could be read as a type of prediction error between tokens {0,...,n} and {1,...,n+1}.
There is some work comparing residual connections and iterative inference that may be of relevance; they show that such architectures “naturally encourage features to move along the negative gradient of loss during the feedforward phase”, I expect some of these insights could be applied to the residual stream in transformers.
Don’t we have some evidence GPTs are doing iterative prediction updating from the logit lens and later tuned lens? Not that that’s all they’re doing of course.
I’m not sure the tuned lens indicates that the model is doing iterative prediction; it shows that if for each layer in the model you train a linear classifier to predict the next token embedding from the activations, as you progress through the model the linear classifiers get more and more accurate. But that’s what we’d expect from any model, regardless of whether it was doing iterative prediction; each layer uses the features from the previous layer to calculate features that are more useful in the next layer. The inception network analysed in the distill.ai circuits thread starts by computing lines and gradients, then curves, then circles, then eyes, then faces, etc. Predicting the class from the presence of faces will be easier than from the presence of lines and gradients, so if you trained a tuned lens on inception v1 it would have the same pattern—lenses from later layers would have lower perplexity. I think to really show iterative prediction, you would have to be able to use the same lens for every layer; that would show that there is some consistent representation of the prediction being updated with each layer.
Here’s the relevant figure from the tuned lens—the transfer penalties for using a lens from one layer on another layer are small but meaningfully non-zero, and tend to increase the further away the layers are in the model. That they are small is suggestive that GPT might be doing something like iterative prediction, but the evidence isn’t compelling enough for my taste.
Thanks for the insightful response! Agree it’s just suggestive for now. Though more then with image models (where I’d expect lenses to transfer really badly, but don’t know). Perhaps it being a residual network is the key thing, since effective path lengths are low most of the information is “carried along” unchanged, meaning the same probe continues working for other layers. Idk