Sure, I am fine with calling it a “prediction objective” but if we drop the simulation abstraction then I think most of the sentences in this post don’t make sense. Here are some sentences which only make sense if you are talking about a simulation in the sense of stepping forward through time, and not just something optimized according to a generic “prediction objective”.
> A simulation is the imitation of the operation of a real-world process or system over time.
[...]
It emphasizes the role of the model as a transition rule that evolves processes over time. The power of factored cognition / chain-of-thought reasoning is obvious.
[...]
It’s clear that in order to actually do anything (intelligent, useful, dangerous, etc), the model must act through simulation of something.
[...]
Well, typically, we avoid getting confused by recognizing a distinction between the laws of physics, which apply everywhere at all times, and spatiotemporally constrained things which evolve according to physics, which can have contingent properties such as caring about a goal.
[...]
Below is a table which compares various simulator-like things to the type of simulator that GPT exemplifies on some quantifiable dimensions. The following properties all characterize GPT:
Generates rollouts: The model naturally generates rollouts, i.e. serves as a time evolution operator
[...]
Not only does the supervised/oracle perspective obscure the importance and limitations of prompting, it also obscures one of the most crucial dimensions of GPT: the implicit time dimension. By this I mean the ability to evolve a process through time by recursively applying GPT, that is, generate text of arbitrary length.
[...]
This resulting policy is capable of animating anything that evolves according to that rule: a far larger set than the sampled trajectories included in the training data, just as there are many more possible configurations that evolve according to our laws of physics than instantiated in our particular time and place and Everett branch.
I think these quotes illustrate that the concept of a simulator as invoked in this post is about simulating the process that gave rise to your training distribution, according to some definition of time. But I don’t think this is how GPT works and I don’t think helps you make good predictions about what happens. Many of the problems GPT successfully solves are not solvable via this kind of simulation, as far as I can tell.
I don’t think the behavior we see in large language model is well-explained by the loss function being a “prediction objective”. Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The fact that the training signal is so myopic on the other hand, and applies on a character-by-character level, that seems to explain a huge amount of the variance.
To be clear, I think there is totally interesting content to study in how language models work given the extremely myopic prediction objective that they optimize, that nevertheless gives rise to interesting high-level behavior, and I agree with you that studying that is among the most important things to do at the present time, but I think this post doesn’t offer a satisfying answer to the questions raised by such studies, and indeed seems to make a bunch of wrong predictions.
Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
the extremely myopic prediction objective that they optimize
As a smaller note, language models do not optimize the predictive objective, so much as the loss function optimizes the language model. I think the wording you chose is going to cause confusion and lead to incorrect beliefs.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.
Sure, I am fine with calling it a “prediction objective” but if we drop the simulation abstraction then I think most of the sentences in this post don’t make sense. Here are some sentences which only make sense if you are talking about a simulation in the sense of stepping forward through time, and not just something optimized according to a generic “prediction objective”.
I think these quotes illustrate that the concept of a simulator as invoked in this post is about simulating the process that gave rise to your training distribution, according to some definition of time. But I don’t think this is how GPT works and I don’t think helps you make good predictions about what happens. Many of the problems GPT successfully solves are not solvable via this kind of simulation, as far as I can tell.
I don’t think the behavior we see in large language model is well-explained by the loss function being a “prediction objective”. Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The fact that the training signal is so myopic on the other hand, and applies on a character-by-character level, that seems to explain a huge amount of the variance.
To be clear, I think there is totally interesting content to study in how language models work given the extremely myopic prediction objective that they optimize, that nevertheless gives rise to interesting high-level behavior, and I agree with you that studying that is among the most important things to do at the present time, but I think this post doesn’t offer a satisfying answer to the questions raised by such studies, and indeed seems to make a bunch of wrong predictions.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
As a smaller note, language models do not optimize the predictive objective, so much as the loss function optimizes the language model. I think the wording you chose is going to cause confusion and lead to incorrect beliefs.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.