I think you missed the point. I agree that language models are predictors rather than imitators, and that they probably don’t work by time-stepping forward a simulation. Maybe Janus should have chosen a word other than “simulators.” But if you gensym out the particular choice of word, this post is encapsulating the most surprising development of the past few years in AI (and therefore, the world).
Chapter 10 of Bostrom’s Superintelligence (2014) is titled, “Oracles, Genies, Sovereigns, Tools”. As the “Inadequate Ontologies” section of this post points out, language models (as they are used and heralded as proto-AGI) aren’t any of those things. (The Claude or ChatGPT “assistant” character is, well, a simulacrum, not “the AI itself”; it’s useful to have the word simulacrum for this.)
This is a big deal! Someone whose story about why we’re all going to die was limited to, “We were right about everything in 2014, but then there was a lot of capabilities progress,” would be willfully ignoring this shocking empirical development (which doesn’t mean we’re not all going to die, but it could be for somewhat different reasons).
repeatedly alludes to the loss function on which GPTs are trained corresponding to a “simulation objective”, but I don’t really see why that would be true [...] particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose
Call it a “prediction objective”, then. The thing that makes the prediction objective special is that it lets us copy intelligence from data, which would have sounded nuts in 2014 and probably still does (but shouldn’t).
If you think of gradient descent as an attempted “utility function transfer” (from loss function to trained agent) that doesn’t really work because of inner misalignment, then it may not be clear why it would induce simulator-like properties in the sense described in the post.
But why would you think of SGD that way? That’s not what the textbook says. Gradient descent is function approximation, curve fitting. We have a lot of data (x, y), and a function f(x, ϕ), and we keep adjusting ϕ to decrease −log P(y|f(x, ϕ)): that is, to make y = f(x, ϕ) less wrong. It turns out that fitting a curve to the entire internet is surprisingly useful, because the internet encodes a lot of knowledge about the world and about reasoning.
If you don’t see why “other loss functions one could choose” aren’t as useful for mirroring the knowledge encoded in the internet, it would probably help to be more specific? What other loss functions? How specifically do you want to adjust ϕ, if not to decrease −log P(y|f(x, ϕ))?
Sure, I am fine with calling it a “prediction objective” but if we drop the simulation abstraction then I think most of the sentences in this post don’t make sense. Here are some sentences which only make sense if you are talking about a simulation in the sense of stepping forward through time, and not just something optimized according to a generic “prediction objective”.
> A simulation is the imitation of the operation of a real-world process or system over time.
[...]
It emphasizes the role of the model as a transition rule that evolves processes over time. The power of factored cognition / chain-of-thought reasoning is obvious.
[...]
It’s clear that in order to actually do anything (intelligent, useful, dangerous, etc), the model must act through simulation of something.
[...]
Well, typically, we avoid getting confused by recognizing a distinction between the laws of physics, which apply everywhere at all times, and spatiotemporally constrained things which evolve according to physics, which can have contingent properties such as caring about a goal.
[...]
Below is a table which compares various simulator-like things to the type of simulator that GPT exemplifies on some quantifiable dimensions. The following properties all characterize GPT:
Generates rollouts: The model naturally generates rollouts, i.e. serves as a time evolution operator
[...]
Not only does the supervised/oracle perspective obscure the importance and limitations of prompting, it also obscures one of the most crucial dimensions of GPT: the implicit time dimension. By this I mean the ability to evolve a process through time by recursively applying GPT, that is, generate text of arbitrary length.
[...]
This resulting policy is capable of animating anything that evolves according to that rule: a far larger set than the sampled trajectories included in the training data, just as there are many more possible configurations that evolve according to our laws of physics than instantiated in our particular time and place and Everett branch.
I think these quotes illustrate that the concept of a simulator as invoked in this post is about simulating the process that gave rise to your training distribution, according to some definition of time. But I don’t think this is how GPT works and I don’t think helps you make good predictions about what happens. Many of the problems GPT successfully solves are not solvable via this kind of simulation, as far as I can tell.
I don’t think the behavior we see in large language model is well-explained by the loss function being a “prediction objective”. Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The fact that the training signal is so myopic on the other hand, and applies on a character-by-character level, that seems to explain a huge amount of the variance.
To be clear, I think there is totally interesting content to study in how language models work given the extremely myopic prediction objective that they optimize, that nevertheless gives rise to interesting high-level behavior, and I agree with you that studying that is among the most important things to do at the present time, but I think this post doesn’t offer a satisfying answer to the questions raised by such studies, and indeed seems to make a bunch of wrong predictions.
Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
the extremely myopic prediction objective that they optimize
As a smaller note, language models do not optimize the predictive objective, so much as the loss function optimizes the language model. I think the wording you chose is going to cause confusion and lead to incorrect beliefs.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.
I think you missed the point. I agree that language models are predictors rather than imitators, and that they probably don’t work by time-stepping forward a simulation. Maybe Janus should have chosen a word other than “simulators.” But if you gensym out the particular choice of word, this post is encapsulating the most surprising development of the past few years in AI (and therefore, the world).
Chapter 10 of Bostrom’s Superintelligence (2014) is titled, “Oracles, Genies, Sovereigns, Tools”. As the “Inadequate Ontologies” section of this post points out, language models (as they are used and heralded as proto-AGI) aren’t any of those things. (The Claude or ChatGPT “assistant” character is, well, a simulacrum, not “the AI itself”; it’s useful to have the word simulacrum for this.)
This is a big deal! Someone whose story about why we’re all going to die was limited to, “We were right about everything in 2014, but then there was a lot of capabilities progress,” would be willfully ignoring this shocking empirical development (which doesn’t mean we’re not all going to die, but it could be for somewhat different reasons).
Call it a “prediction objective”, then. The thing that makes the prediction objective special is that it lets us copy intelligence from data, which would have sounded nuts in 2014 and probably still does (but shouldn’t).
If you think of gradient descent as an attempted “utility function transfer” (from loss function to trained agent) that doesn’t really work because of inner misalignment, then it may not be clear why it would induce simulator-like properties in the sense described in the post.
But why would you think of SGD that way? That’s not what the textbook says. Gradient descent is function approximation, curve fitting. We have a lot of data (x, y), and a function f(x, ϕ), and we keep adjusting ϕ to decrease −log P(y|f(x, ϕ)): that is, to make y = f(x, ϕ) less wrong. It turns out that fitting a curve to the entire internet is surprisingly useful, because the internet encodes a lot of knowledge about the world and about reasoning.
If you don’t see why “other loss functions one could choose” aren’t as useful for mirroring the knowledge encoded in the internet, it would probably help to be more specific? What other loss functions? How specifically do you want to adjust ϕ, if not to decrease −log P(y|f(x, ϕ))?
Sure, I am fine with calling it a “prediction objective” but if we drop the simulation abstraction then I think most of the sentences in this post don’t make sense. Here are some sentences which only make sense if you are talking about a simulation in the sense of stepping forward through time, and not just something optimized according to a generic “prediction objective”.
I think these quotes illustrate that the concept of a simulator as invoked in this post is about simulating the process that gave rise to your training distribution, according to some definition of time. But I don’t think this is how GPT works and I don’t think helps you make good predictions about what happens. Many of the problems GPT successfully solves are not solvable via this kind of simulation, as far as I can tell.
I don’t think the behavior we see in large language model is well-explained by the loss function being a “prediction objective”. Imagine a prediction objective that is not myopic, but requires creating long chains of internal inference to arrive at, more similar to the length of a full-context completion of GPT. I don’t see how such a prediction objective would give rise to the interesting dynamics that seem true about GPT. My guess is in the pursuit of such a non-myopic prediction objective you would see the development of quite instrumental forms of reasoning and general purpose problem-solving, with substantial divergence from how we currently think of GPTs.
The fact that the training signal is so myopic on the other hand, and applies on a character-by-character level, that seems to explain a huge amount of the variance.
To be clear, I think there is totally interesting content to study in how language models work given the extremely myopic prediction objective that they optimize, that nevertheless gives rise to interesting high-level behavior, and I agree with you that studying that is among the most important things to do at the present time, but I think this post doesn’t offer a satisfying answer to the questions raised by such studies, and indeed seems to make a bunch of wrong predictions.
The pretraining objective isn’t myopic? The parameter updates route across the entire context, backing up from the attention scores of later positions through e.g. the MLP sublayer outputs at position 0.
As a smaller note, language models do not optimize the predictive objective, so much as the loss function optimizes the language model. I think the wording you chose is going to cause confusion and lead to incorrect beliefs.
This is something I’ve been thinking a lot about, but still don’t feel super robust in. I currently think it makes sense to describe the pretraining objective as myopic in the relevant way, but am really not confident. I agree that the training objective isn’t as myopic as I implied here, though I also don’t think the training objective is well-summarized as jointly optimizing the whole context-length response.
I have a dialogue I’ll probably publish soon about this, and would be interested in your comments on it when it goes live. Probably doesn’t make sense to go in-depth about this before that’s published, since it captures my current confusions and thoughts probably better than what I would write anew in a comment thread like this.