I will try to explain Yann Lecun’s argument against auto-regressive LLMs, which I agree with. The main crux of it is that being extremely superhuman at predicting the next token from the distribution of internet text does not imply the ability to generate sequences of arbitrary length from that distribution.
GPT4′s ability to impressively predict the next token depends very crucially on the tokens in its context window actually belonging to the distribution of internet text written by humans. When you run GPT in sampling mode, every token you sample from it takes it ever so slightly outside the distribution it was trained on. At each new generated token it still assumes that the past 999 tokens were written by humans, but since its actual input was generated partly by itself, as the length of the sequence you wish to predict increases, you take GPT further and further outside of the distribution it knows.
The most salient example of this is when you try to make chatGPT play chess and write chess analysis. At some point, it will make a mistake and write something like “the queen was captured” when in fact the queen was not captured. This is not the kind of mistake that chess books make, so it truly takes it out of distribution. What ends up happening is that GPT conditions its future output on its mistake being correct, which takes it even further outside the distribution of human text, until this diverges into nonsensical moves.
As GPT becomes better, the length of the sequences it can convincingly generate increases, but the probability of a sequence being correct is (1-e)^n, cutting the error rate in half (a truly outstanding feat) merely doubles the length of its coherent sequences.
To solve this problem you would need a very large dataset of mistakes made by LLMs, and their true continuations. You’d need to take all physics books ever written, intersperse them with LLM continuations, then have humans write the corrections to the continuations, like “oh, actually we made a mistake in the last paragraph, here is the correct way to relate pressure to temperature in this problem...”. This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.
The conclusion that Lecun comes to: auto-regressive LLMs are doomed.
The most salient example of this is when you try to make chatGPT play chess and write chess analysis. At some point, it will make a mistake and write something like “the queen was captured” when in fact the queen was not captured. This is not the kind of mistake that chess books make, so it truly takes it out of distribution. What ends up happening is that GPT conditions its future output on its mistake being correct, which takes it even further outside the distribution of human text, until this diverges into nonsensical moves.
Is this a limitation in practice? Rap Battles are a bad example because they happen to be the exception of a task premised on being “one shot” and real time, but the overall point stands. We ask GPT to do tasks in one try, one step, that humans do with many steps, iteratively and recursively.
Take this “the queen was captured” problem. As a human I might be analyzing a game, glance at the wrong move, think a thought about the analysis premised on that move (or even start writing words down!) and then notice the error and just fix it. I am doing this right now, in my thoughts and on the keyboard, writing this comment.
Same thing works with ChatGPT, today. I deal with problems like “the queen was captured” every day just by adding more ChatGPT steps. Instead of one-shotting, every completion chains a second ChatGPT prompt to check for mistakes. (You may need a third level to get to like 99% because the checker blunders too.) The background chains can either ask to regenerate the original prompt, or reply to the original ChatGPT describing the error, and ask it to fix its mistake. The latter form seems useful for code generation.
Like right now I typically do 2 additional background chains by default, for every single thing I ask Chat GPT. Not just in a task where I’m seeking rigour and want to avoid factual mistakes like “the queen was captured” but just to get higher quality responses in general.
Original Prompt → Improve this answer. → Improve this Answer.
Not literally just those three words, but even something that simple is actually better than just asking one time. Seriously. Try it, confirm, and make it a habit. Sometimes it’s shocking. I ask for a simple javascript function, it pumps out a 20 line function that looks fine to me. I habitually ask for a better version and “Upon reflection, you can do this in two lines of javascript that run 100x faster.”
If GPT were 100x cheaper I would be tempted just go wild with this. Every prompt is 200 or 300 prompts in the background, invisibly, instead of 2 or 3. I’m sure there’s diminishing returns and the chain would be more complicated than repeating “Improve” 100 times, but it were fast and cheap enough, why not do it.
As an aside, I think about asking ChatGPT to write code like asking a human to code a project on a whiteboard without the internet to find answers, a computer to run code on, or even paper references. The human can probably do it, sort of, but I bet the code will have tons of bugs and errors and even API ‘hallucinations’ if you run it! I think it’s even worse than that, it’s almost like ChatGPT isn’t even allowed to erase anything it wrote on white board either. But we don’t need to one shot everything, so do we care about infinite length completions? Humans do things in steps, and when ChatGPT isn’t trying to whiteboard everything, when it can check API references, when it can see what the code returns, errors, when it can recurse on itself to improve things, it’s way better. Right now the form this takes is a human on the ChatGPT web page asking for code, running it, and then pasting the error message back into ChatGPT. The more automated versions of this are trickling out. Then I imagine the future, asking ChatGPT for code when its 1000x cheaper. And my one question behind the scenes is actually 1000 prompts looking up APIs on the internet, running the code in a simulator (or for real, people are already doing that) looking at the errors or results, etc. And that’s the boring unimaginative extrapolation.
Also this is probably obvious, but just in case: if you try asking “Improve this answer.” repeatedly in ChatGPT you need to manage your context window size. Migrate to a new conversation when you get about 75% full. OpenAI should really warn you because even before 100% the quality drops like a rock. Just copy your original request and the last best answer(s). If you’re doing it manually select a few useful other bits too.
I think you’ve had more luck than me when trying to get chatGPT to correct its own mistakes. When I tried making it play chess, I told it to “be sure not to output your move before writing a paragraph of analysis on the current board position, and output 5 good moves and the reasoning behind them, all of this before giving me your final move.” Then after it chose its move I told it “are you sure this is a legal move? and is this really the best move?”, it pretty much never changed its answer, and never managed to figure out that its illegal moves were illegal. If I straight-up told it “this move is illegal”, it would excuse itself and output something else, and sometimes it correctly understood why its move was illegal, but not always.
so do we care about infinite length completions?
The inability of the GPT series to generate infinite length completions is crucial for safety! If humans fundamentally need to be in the loop for GPT to give us good outputs for things like scientific reasoning, then it makes the whole thing suddenly way safer, and we can be assured that there isn’t an instance of GPT running on some amazon server self-improving itself by just doing a thousand years of scientific progress in a week.
Does the inability of the GPT series to generate infinite length completions require that humans specifically remain in the loop, or just that the external world must remain in the loop in some way which gets the model back into the distribution? Because if it’s the latter case I think you still have to worry about some instance running on a cloud server somewhere.
When I prompt GPT-5 it’s already out of distribution because the training data mostly isn’t GPT prompts, and none of it is GPT-5 prompts. If I prompt with “this is a rap battle between Dath Ilan and Earthsea” that’s not a high likelihood sentence in the training data. And then the response is also out of distribution, because the training data mostly isn’t GPT responses, and none of it is GPT-5 responses.
So why do we think that the responses are further out of distribution than the prompts?
Possible answer: because we try to select prompts that work well, with human ingenuity and trial and error, so they will tend to work better and be effectively closer to the distribution. Whereas the responses are not filtered in the same way.
But the responses are optimized only to be in distribution, whereas the prompts are also optimized for achieving some human objective like generating a funny rap battle. So once the optimizer achieves some threshold of reliability the error rate should go down as text is generated, not up.
“Being out of distribution” is not a yes-no answer, but a continuum. I agree that all prompts given to GPT are slightly out of distribution simply by virtue of being prompts to a language model, but the length of a prompt is generally not large enough to enable GPT to really be sure of that. If I give you 3 sentences of a made-up physics book introduction, you might guess that no textbook actually starts with those 3 sentences… but that’s really just not enough information to be sure. However, if I give you 5 pages, you then have enough information to really understand if this is really a physics textbook or not.
The point is that sequence length matters, the internet is probably large enough to populate the space of 200-token (number pulled out of my ass) text sequences densely enough that GPT can extrapolate to most other sequences of such length, but things gradually change as the sequences get longer. And certainly by the time you get to book-length or longer, any sequence that GPT could generate will be so far out of distribution that it will be complete gibberish.
Could we agree on a testable prediction of this theory? For example, looking at the chess degradation example. I think your argument predicts that if we play several games of chess against ChatGPT in a row, its performance will keep going down in later games, in terms of both quality and legality. Potentially such that the last attempt will be complete gibberish. Would that be a good test?
Certainly I would agree with that. In fact right now I can’t even get chatGPT to play a single game of chess (against stockfish)from start to finish without it at some point outputting an illegal move. I expect that future versions of GPT will be coherent for longer, but I don’t expect GPT to suddenly “get it” and be able to play legal and coherent chess for arbitrary length of sequences. (Google tells me that chess has a typical sequence length of about 40, so maybe Go would be a better choice with a typical number of moves per game in the 150). And certainly I don’t expect GPT to be able to play chess AND also write coherent chess commentary between each move, since that would greatly increase the timescale of required coherence.
GPT-4 was privately available within OpenAI long before it was publically released. It’s not necessary to be from the future to be able to interact with GPT-5 before it’s publically released.
Okay, but I’m still wondering if Randall is claiming he has private access, or is it just a typo?
Edit: looks like it was a typo?
At MIT, Altman said the letter was “missing most technical nuance about where we need the pause” and noted that an earlier version claimed that OpenAI is currently training GPT-5. “We are not and won’t for some time,” said Altman. “So in that sense it was sort of silly.”
After the initial prompt, GPT’s input is 100% self-generated
GPT has no access to plugins
GPT can’t launch processes to gather and train additional models on other forms of data
I’m not an expert in this topic, but it seems to me that “doomed” is the wrong word. LLMs aren’t the fastest or most reliable way to compute 2+2, but it is going to become trivial for them to access the tool that is the best way to perform this computation. They will be able to gather data from the outside world using these plugins. They will be able to launch fine-tuning and training processes and interact with other pre-trained models. They will be able to interact with robotics and access cloud computing resources.
LLMs strike me as analogous to the cell. Is a cell capable of vision on its own? Only in the most rudimentary sense of having photoresponsive molecules that trigger cell signals. But cells that are configured correctly can form an eye. And we know that cells have somehow been able to evolve themselves into a functioning eye. I don’t see a reason why LLMs, perhaps in combination with other software structures, can’t form an AGI with some combination of human and AI-assisted engineering.
To make the argument sharper, I will argue the following (taken from another comment of mine and posted here to have it in one place): sequences produced by LLMs very quickly become sequences with very low log-probability (compared with other sequences of the same length) under the true distribution of internet text.
Suppose we have a markov chain xn with some transition probability p(xn+1|xn), here p is the analogue of the true generating distribution of internet text. From information theory (specifically the Asymptotic Equipartition Property), we know that the typical probability of a long sequence will be p(x1,...,xn)=exp(−nHp(X)), where Hp(X) is the entropy of the process.
Now if q(xn+1|xn) is a different markov chain (the analogue of the LLM generating text), which differs from p by some amount, say that the Kullback-Leibler divergence DKL(q||p) is non-zero (which is not quite the objective that the networks are being trained with, that would be DKL(p||q) instead), we can also compute the expected probability under p of sequences sampled from q, this is going to be:
The second term in this integral is just −nHq(X) , n times the entropy of q, and the first term is −nDKL(q||p), so when we put everything together:
p(x1,...,xn)=exp(−n(DKL(q||p)+Hq(X)))
So any difference at all between Hp(X) and DKL(q||p)+Hq(X) will lead to the probability of almost all sequences sampled from our language model being exponentially squashed relative to the probability of most sequences sampled from the original distribution. I can also argue that Hq(X) will be strictly larger than Hp(X): the latter essentially can be viewed as the entropy resulting from a perfect LLM with infinite context window, and H(X|Y)≤H(X), conditioning on further information does not increase the entropy. So (DKL(q||p)+Hq(X)−Hp(X)) will definitely be positive.
This means that if you sample long enough from an LLM, and more importantly as the context window increases, it must generalise very far out of distribution to give good outputs. The fundamental problem of behaviour cloning I’m referring to is that we need examples of how to behave correctly is this very-out-of-distribution regime, but LLMs simply rely on the generalisation ability of transformer networks. Our prior should be that if you don’t provide examples of correct outputs within some region of the input space to your function fitting algorithm, you don’t expect the algorithm to yield correct predictions in that region.
At each new generated token it still assumes that the past 999 tokens were written by humans
By now means this is necessary. During fine-tuning for dialogue and question-answering, GPT is clearly selected for discriminating the boundaries of the user-generated, and, equivalently, self-generated text in its context (and probably these boundaries are marked with special control tokens).
If we were token about GPTs trained in pure SSL mode without fine-tuning whatsoever, that would be a different story, but this is not practically the case.
I have trouble framing this thing in my mind because I do not understand what the distribution is relative to. In the strictest sense, the distribution of internet text is the internet text itself, and everything GPT outputs is an error. In a broad sense, what is an error and what isn’t? I think there’s something meaningful here, but I can not pinpoint it clearly.
This strongly shows that GPT won’t be able to stay coherent with some initial state, which was already clear from it being autoregressive. It only weakly indicates that GPT won’t learn, somewhere in its weights, the correct schemes to play chess, which could then be somehow elicited.
How does this not apply to humans? It seems to me we humans do have a finite context window, within which we can interact with a permanent associative memory system to stay coherent on a longer term. The next obvious step with LLMs is introducing tokens that represent actions and have it interact with other subsystems or the external world, like many people are trying to do (e.g., PaLM-e). If this direction of improvement pans out, I would argue that LLMs leading to these “augmented LLMs” would not count as “LLMs being doomed”.
3a) It applies to humans, and humans are doomed :)
LLMs are already somewhat able to generate dialogues where they err and then correct in a systematic way (e.g., reflexion). If there really was the need to create large datasets with err-and-correct-text, I do not exclude they could be generated with the assistance of existing LLMs.
This strongly shows that GPT won’t be able to stay coherent with some initial state, which was already clear from it being autoregressive
This problem is not coming from the autoregressive part, if the dataset GPT was trained on contained a lot of examples of GPT making mistakes and then being corrected, it would be able to stay coherent for a long time (once it starts to make small deviations, it would immediately correct them because those small deviations were in the dataset, making it stable). This doesn’t apply to humans because humans don’t produce their actions by trying to copy some other agent, they learn their policy through interaction with the environment. So it’s not that a system in general is unable to stay coherent for long, but only those systems trained by pure imitation that aren’t able to do so.
Ok, now I understand better and I agree with this point, it’s like when you learn something faster if a teacher lets you try in small steps and corrects your errors at a granular level instead of leaving you alone in front of a large task you blankly stare at.
It seems to me like you only need to finetune a dataset of like 50k diverse samples of this type of error correction built in, or RLHF this type of error correction?
This same problem exists in the behaviour cloning literature, if you have an expert agent behaving under some policy πexpert, and you want to train some other policy to copy the expert, samples from the expert policy are not enough, you need to have a lot of data that shows your agent how to behave when it gets out of distribution, this was the point of the DAGGER paper, and in practice the data that shows the agent how to get back into distribution is significantly larger than the pure expert dataset. There are very many ways that GPT might go out of distribution, and just showing it how to come back for a small fraction of examples won’t be enough.
I have not read the paper you link, but I have this expectation about it: that the limitation of imitation learning is proved in a context that lacks richness compared to imitating language.
My intuition is: I have experience myself of failing to learn just from imitating an expert playing a game the best way possible. But if someone explains to me their actions, I can then learn something.
Language is flexible and recursive: you can in principle represent anything out of the real world in language, including language itself, and how to think. If somehow the learner manages to tap into recursiveness, it can shortcut the levels. It will learn how to act meaningfully not because it has covered all the possible examples of long-term sequences that lead to a goal, but because it has seen many schemes that map to how the expert thinks.
I can not learn chess efficiently by observing a grandmaster play many matches and jotting down all the moves. I could do it if the grandmaster was a short program if implemented in chess moves.
This is an alignment problem: You/LeCunn want semantic truth, whereas the actual loss function has the goal of producing statistically reasonable text.
Mostly. The fine tuning stage puts an additional layer on top of all that, and skews the model towards stating true things so much that we get surprised when it *doesn’t*.
What I would suggest is that aligning an LLM to produce text should not be done with RLHF, instead it may need to extract the internal truth predicate from the model and ensure that the output is steered to keep that neuron assembly lit up.
To solve this problem you would need a very large dataset of mistakes made by LLMs, and their true continuations. [...] This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.
I had assumed that creating on that dataset was a major reason for doing a public release of ChatGPT. “Was this a good response?” [thumb-up] / [thumb-down] → dataset → more RLHF. Right?
RLHF is done after the pre-training process. I believe this is referring to including examples like this in the pre-training process itself.
Though in broad strokes, I agree with you. It’s not inconceivable to me that they’ll turn/are turning their ChatGPT data into its own training data for future models using this concept of corrected mistakes.
I’ve never enjoyed, or agreed with, arguments of the form: “X is inherently, intrinsically incapable of Y.” The presence of such statements indicates that there is some social tension of the form “X might be inherently, intrinsically capable of Y.” There might be a bias towards the moderate social acceptance of statements such as “X is inherently, intrinsically incapable of Y” due to no more than it being possible to disprove trivially, if X is inherently, intrinsically capable of Y. Disprovable statements might be overrated a lot, and if so, boy, would I hate that.
This seems kind of relevant to the main point of this post too:
GPTs are not Imitators, nor Simulators, but Predictors.
Question: Is GPT-5 an Imitator? Simulator? And Predictor? Is GPT-6?
Does the message of this post become moot on larger, more powerful LLMs? Or does it predict that such models have already reached their limit?
I will try to explain Yann Lecun’s argument against auto-regressive LLMs, which I agree with. The main crux of it is that being extremely superhuman at predicting the next token from the distribution of internet text does not imply the ability to generate sequences of arbitrary length from that distribution.
GPT4′s ability to impressively predict the next token depends very crucially on the tokens in its context window actually belonging to the distribution of internet text written by humans. When you run GPT in sampling mode, every token you sample from it takes it ever so slightly outside the distribution it was trained on. At each new generated token it still assumes that the past 999 tokens were written by humans, but since its actual input was generated partly by itself, as the length of the sequence you wish to predict increases, you take GPT further and further outside of the distribution it knows.
The most salient example of this is when you try to make chatGPT play chess and write chess analysis. At some point, it will make a mistake and write something like “the queen was captured” when in fact the queen was not captured. This is not the kind of mistake that chess books make, so it truly takes it out of distribution. What ends up happening is that GPT conditions its future output on its mistake being correct, which takes it even further outside the distribution of human text, until this diverges into nonsensical moves.
As GPT becomes better, the length of the sequences it can convincingly generate increases, but the probability of a sequence being correct is (1-e)^n, cutting the error rate in half (a truly outstanding feat) merely doubles the length of its coherent sequences.
To solve this problem you would need a very large dataset of mistakes made by LLMs, and their true continuations. You’d need to take all physics books ever written, intersperse them with LLM continuations, then have humans write the corrections to the continuations, like “oh, actually we made a mistake in the last paragraph, here is the correct way to relate pressure to temperature in this problem...”. This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.
The conclusion that Lecun comes to: auto-regressive LLMs are doomed.
Is this a limitation in practice? Rap Battles are a bad example because they happen to be the exception of a task premised on being “one shot” and real time, but the overall point stands. We ask GPT to do tasks in one try, one step, that humans do with many steps, iteratively and recursively.
Take this “the queen was captured” problem. As a human I might be analyzing a game, glance at the wrong move, think a thought about the analysis premised on that move (or even start writing words down!) and then notice the error and just fix it. I am doing this right now, in my thoughts and on the keyboard, writing this comment.
Same thing works with ChatGPT, today. I deal with problems like “the queen was captured” every day just by adding more ChatGPT steps. Instead of one-shotting, every completion chains a second ChatGPT prompt to check for mistakes. (You may need a third level to get to like 99% because the checker blunders too.) The background chains can either ask to regenerate the original prompt, or reply to the original ChatGPT describing the error, and ask it to fix its mistake. The latter form seems useful for code generation.
Like right now I typically do 2 additional background chains by default, for every single thing I ask Chat GPT. Not just in a task where I’m seeking rigour and want to avoid factual mistakes like “the queen was captured” but just to get higher quality responses in general.
Original Prompt → Improve this answer. → Improve this Answer.
Not literally just those three words, but even something that simple is actually better than just asking one time. Seriously. Try it, confirm, and make it a habit. Sometimes it’s shocking. I ask for a simple javascript function, it pumps out a 20 line function that looks fine to me. I habitually ask for a better version and “Upon reflection, you can do this in two lines of javascript that run 100x faster.”
If GPT were 100x cheaper I would be tempted just go wild with this. Every prompt is 200 or 300 prompts in the background, invisibly, instead of 2 or 3. I’m sure there’s diminishing returns and the chain would be more complicated than repeating “Improve” 100 times, but it were fast and cheap enough, why not do it.
As an aside, I think about asking ChatGPT to write code like asking a human to code a project on a whiteboard without the internet to find answers, a computer to run code on, or even paper references. The human can probably do it, sort of, but I bet the code will have tons of bugs and errors and even API ‘hallucinations’ if you run it! I think it’s even worse than that, it’s almost like ChatGPT isn’t even allowed to erase anything it wrote on white board either. But we don’t need to one shot everything, so do we care about infinite length completions? Humans do things in steps, and when ChatGPT isn’t trying to whiteboard everything, when it can check API references, when it can see what the code returns, errors, when it can recurse on itself to improve things, it’s way better. Right now the form this takes is a human on the ChatGPT web page asking for code, running it, and then pasting the error message back into ChatGPT. The more automated versions of this are trickling out. Then I imagine the future, asking ChatGPT for code when its 1000x cheaper. And my one question behind the scenes is actually 1000 prompts looking up APIs on the internet, running the code in a simulator (or for real, people are already doing that) looking at the errors or results, etc. And that’s the boring unimaginative extrapolation.
Also this is probably obvious, but just in case: if you try asking “Improve this answer.” repeatedly in ChatGPT you need to manage your context window size. Migrate to a new conversation when you get about 75% full. OpenAI should really warn you because even before 100% the quality drops like a rock. Just copy your original request and the last best answer(s). If you’re doing it manually select a few useful other bits too.
I think you’ve had more luck than me when trying to get chatGPT to correct its own mistakes. When I tried making it play chess, I told it to “be sure not to output your move before writing a paragraph of analysis on the current board position, and output 5 good moves and the reasoning behind them, all of this before giving me your final move.” Then after it chose its move I told it “are you sure this is a legal move? and is this really the best move?”, it pretty much never changed its answer, and never managed to figure out that its illegal moves were illegal. If I straight-up told it “this move is illegal”, it would excuse itself and output something else, and sometimes it correctly understood why its move was illegal, but not always.
The inability of the GPT series to generate infinite length completions is crucial for safety! If humans fundamentally need to be in the loop for GPT to give us good outputs for things like scientific reasoning, then it makes the whole thing suddenly way safer, and we can be assured that there isn’t an instance of GPT running on some amazon server self-improving itself by just doing a thousand years of scientific progress in a week.
Does the inability of the GPT series to generate infinite length completions require that humans specifically remain in the loop, or just that the external world must remain in the loop in some way which gets the model back into the distribution? Because if it’s the latter case I think you still have to worry about some instance running on a cloud server somewhere.
When I prompt GPT-5 it’s already out of distribution because the training data mostly isn’t GPT prompts, and none of it is GPT-5 prompts. If I prompt with “this is a rap battle between Dath Ilan and Earthsea” that’s not a high likelihood sentence in the training data. And then the response is also out of distribution, because the training data mostly isn’t GPT responses, and none of it is GPT-5 responses.
So why do we think that the responses are further out of distribution than the prompts?
Possible answer: because we try to select prompts that work well, with human ingenuity and trial and error, so they will tend to work better and be effectively closer to the distribution. Whereas the responses are not filtered in the same way.
But the responses are optimized only to be in distribution, whereas the prompts are also optimized for achieving some human objective like generating a funny rap battle. So once the optimizer achieves some threshold of reliability the error rate should go down as text is generated, not up.
“Being out of distribution” is not a yes-no answer, but a continuum. I agree that all prompts given to GPT are slightly out of distribution simply by virtue of being prompts to a language model, but the length of a prompt is generally not large enough to enable GPT to really be sure of that. If I give you 3 sentences of a made-up physics book introduction, you might guess that no textbook actually starts with those 3 sentences… but that’s really just not enough information to be sure. However, if I give you 5 pages, you then have enough information to really understand if this is really a physics textbook or not.
The point is that sequence length matters, the internet is probably large enough to populate the space of 200-token (number pulled out of my ass) text sequences densely enough that GPT can extrapolate to most other sequences of such length, but things gradually change as the sequences get longer. And certainly by the time you get to book-length or longer, any sequence that GPT could generate will be so far out of distribution that it will be complete gibberish.
Could we agree on a testable prediction of this theory? For example, looking at the chess degradation example. I think your argument predicts that if we play several games of chess against ChatGPT in a row, its performance will keep going down in later games, in terms of both quality and legality. Potentially such that the last attempt will be complete gibberish. Would that be a good test?
Certainly I would agree with that. In fact right now I can’t even get chatGPT to play a single game of chess (against stockfish) from start to finish without it at some point outputting an illegal move. I expect that future versions of GPT will be coherent for longer, but I don’t expect GPT to suddenly “get it” and be able to play legal and coherent chess for arbitrary length of sequences. (Google tells me that chess has a typical sequence length of about 40, so maybe Go would be a better choice with a typical number of moves per game in the 150). And certainly I don’t expect GPT to be able to play chess AND also write coherent chess commentary between each move, since that would greatly increase the timescale of required coherence.
Did you mean GPT-4 here? (Or are you from the future :-)
Just a confusing writing choice, sorry. Either it’s the timeless present tense or it’s a grammar error, take your pick.
GPT-4 was privately available within OpenAI long before it was publically released. It’s not necessary to be from the future to be able to interact with GPT-5 before it’s publically released.
Okay, but I’m still wondering if Randall is claiming he has private access, or is it just a typo?
Edit: looks like it was a typo?
https://www.theverge.com/2023/4/14/23683084/openai-gpt-5-rumors-training-sam-altman
This argument seems to depend on:
After the initial prompt, GPT’s input is 100% self-generated
GPT has no access to plugins
GPT can’t launch processes to gather and train additional models on other forms of data
I’m not an expert in this topic, but it seems to me that “doomed” is the wrong word. LLMs aren’t the fastest or most reliable way to compute 2+2, but it is going to become trivial for them to access the tool that is the best way to perform this computation. They will be able to gather data from the outside world using these plugins. They will be able to launch fine-tuning and training processes and interact with other pre-trained models. They will be able to interact with robotics and access cloud computing resources.
LLMs strike me as analogous to the cell. Is a cell capable of vision on its own? Only in the most rudimentary sense of having photoresponsive molecules that trigger cell signals. But cells that are configured correctly can form an eye. And we know that cells have somehow been able to evolve themselves into a functioning eye. I don’t see a reason why LLMs, perhaps in combination with other software structures, can’t form an AGI with some combination of human and AI-assisted engineering.
Apparently LLMs automatically correct mistakes in CoT, which seems to run counter to LeCun’s argument.
To make the argument sharper, I will argue the following (taken from another comment of mine and posted here to have it in one place): sequences produced by LLMs very quickly become sequences with very low log-probability (compared with other sequences of the same length) under the true distribution of internet text.
Suppose we have a markov chain xn with some transition probability p(xn+1|xn), here p is the analogue of the true generating distribution of internet text. From information theory (specifically the Asymptotic Equipartition Property), we know that the typical probability of a long sequence will be p(x1,...,xn)=exp(−nHp(X)), where Hp(X) is the entropy of the process.
Now if q(xn+1|xn) is a different markov chain (the analogue of the LLM generating text), which differs from p by some amount, say that the Kullback-Leibler divergence DKL(q||p) is non-zero (which is not quite the objective that the networks are being trained with, that would be DKL(p||q) instead), we can also compute the expected probability under p of sequences sampled from q, this is going to be:
Exn∼qlogp(x1,...,xn)=∫(q(x1,...,xn)logp(x1,...,xn))dx1...dxn
=∫(q(x1,...,xn)logp(x1,...,xn)q(x1,...,xn)+q(x1,...,xn)logq(x1,...,xn))dx1...dxn
The second term in this integral is just −nHq(X) , n times the entropy of q, and the first term is −nDKL(q||p), so when we put everything together:
p(x1,...,xn)=exp(−n(DKL(q||p)+Hq(X)))
So any difference at all between Hp(X) and DKL(q||p)+Hq(X) will lead to the probability of almost all sequences sampled from our language model being exponentially squashed relative to the probability of most sequences sampled from the original distribution. I can also argue that Hq(X) will be strictly larger than Hp(X): the latter essentially can be viewed as the entropy resulting from a perfect LLM with infinite context window, and H(X|Y)≤H(X), conditioning on further information does not increase the entropy. So (DKL(q||p)+Hq(X)−Hp(X)) will definitely be positive.
This means that if you sample long enough from an LLM, and more importantly as the context window increases, it must generalise very far out of distribution to give good outputs. The fundamental problem of behaviour cloning I’m referring to is that we need examples of how to behave correctly is this very-out-of-distribution regime, but LLMs simply rely on the generalisation ability of transformer networks. Our prior should be that if you don’t provide examples of correct outputs within some region of the input space to your function fitting algorithm, you don’t expect the algorithm to yield correct predictions in that region.
By now means this is necessary. During fine-tuning for dialogue and question-answering, GPT is clearly selected for discriminating the boundaries of the user-generated, and, equivalently, self-generated text in its context (and probably these boundaries are marked with special control tokens).
If we were token about GPTs trained in pure SSL mode without fine-tuning whatsoever, that would be a different story, but this is not practically the case.
I have trouble framing this thing in my mind because I do not understand what the distribution is relative to. In the strictest sense, the distribution of internet text is the internet text itself, and everything GPT outputs is an error. In a broad sense, what is an error and what isn’t? I think there’s something meaningful here, but I can not pinpoint it clearly.
This strongly shows that GPT won’t be able to stay coherent with some initial state, which was already clear from it being autoregressive. It only weakly indicates that GPT won’t learn, somewhere in its weights, the correct schemes to play chess, which could then be somehow elicited.
How does this not apply to humans? It seems to me we humans do have a finite context window, within which we can interact with a permanent associative memory system to stay coherent on a longer term. The next obvious step with LLMs is introducing tokens that represent actions and have it interact with other subsystems or the external world, like many people are trying to do (e.g., PaLM-e). If this direction of improvement pans out, I would argue that LLMs leading to these “augmented LLMs” would not count as “LLMs being doomed”.
3a) It applies to humans, and humans are doomed :)
LLMs are already somewhat able to generate dialogues where they err and then correct in a systematic way (e.g., reflexion). If there really was the need to create large datasets with err-and-correct-text, I do not exclude they could be generated with the assistance of existing LLMs.
This problem is not coming from the autoregressive part, if the dataset GPT was trained on contained a lot of examples of GPT making mistakes and then being corrected, it would be able to stay coherent for a long time (once it starts to make small deviations, it would immediately correct them because those small deviations were in the dataset, making it stable). This doesn’t apply to humans because humans don’t produce their actions by trying to copy some other agent, they learn their policy through interaction with the environment. So it’s not that a system in general is unable to stay coherent for long, but only those systems trained by pure imitation that aren’t able to do so.
Ok, now I understand better and I agree with this point, it’s like when you learn something faster if a teacher lets you try in small steps and corrects your errors at a granular level instead of leaving you alone in front of a large task you blankly stare at.
For a response to this, see my comment above.
It seems to me like you only need to finetune a dataset of like 50k diverse samples of this type of error correction built in, or RLHF this type of error correction?
This same problem exists in the behaviour cloning literature, if you have an expert agent behaving under some policy πexpert, and you want to train some other policy to copy the expert, samples from the expert policy are not enough, you need to have a lot of data that shows your agent how to behave when it gets out of distribution, this was the point of the DAGGER paper, and in practice the data that shows the agent how to get back into distribution is significantly larger than the pure expert dataset. There are very many ways that GPT might go out of distribution, and just showing it how to come back for a small fraction of examples won’t be enough.
I have not read the paper you link, but I have this expectation about it: that the limitation of imitation learning is proved in a context that lacks richness compared to imitating language.
My intuition is: I have experience myself of failing to learn just from imitating an expert playing a game the best way possible. But if someone explains to me their actions, I can then learn something.
Language is flexible and recursive: you can in principle represent anything out of the real world in language, including language itself, and how to think. If somehow the learner manages to tap into recursiveness, it can shortcut the levels. It will learn how to act meaningfully not because it has covered all the possible examples of long-term sequences that lead to a goal, but because it has seen many schemes that map to how the expert thinks.
I can not learn chess efficiently by observing a grandmaster play many matches and jotting down all the moves. I could do it if the grandmaster was a short program if implemented in chess moves.
This is an alignment problem: You/LeCunn want semantic truth, whereas the actual loss function has the goal of producing statistically reasonable text.
Mostly. The fine tuning stage puts an additional layer on top of all that, and skews the model towards stating true things so much that we get surprised when it *doesn’t*.
What I would suggest is that aligning an LLM to produce text should not be done with RLHF, instead it may need to extract the internal truth predicate from the model and ensure that the output is steered to keep that neuron assembly lit up.
I had assumed that creating on that dataset was a major reason for doing a public release of ChatGPT. “Was this a good response?” [thumb-up] / [thumb-down] → dataset → more RLHF. Right?
RLHF is done after the pre-training process. I believe this is referring to including examples like this in the pre-training process itself.
Though in broad strokes, I agree with you. It’s not inconceivable to me that they’ll turn/are turning their ChatGPT data into its own training data for future models using this concept of corrected mistakes.
I’ve never enjoyed, or agreed with, arguments of the form: “X is inherently, intrinsically incapable of Y.” The presence of such statements indicates that there is some social tension of the form “X might be inherently, intrinsically capable of Y.” There might be a bias towards the moderate social acceptance of statements such as “X is inherently, intrinsically incapable of Y” due to no more than it being possible to disprove trivially, if X is inherently, intrinsically capable of Y. Disprovable statements might be overrated a lot, and if so, boy, would I hate that.
This seems kind of relevant to the main point of this post too:
Question: Is GPT-5 an Imitator? Simulator? And Predictor? Is GPT-6?
Does the message of this post become moot on larger, more powerful LLMs? Or does it predict that such models have already reached their limit?