I understood that the argument given for (2) (gets out of distribution) came from (1) (the (1-e)^n thing), so could you restate the argument for (2) in a self-conclusive way without using (1)?
No problem, though apologies if I say stuff that you already know, it’s both to remind myself of these concepts and because I don’t know your background.
Suppose we have a markov chain xn with some transition probability p(xn+1|xn), here p is the analogue of the true generating distribution of internet text. From information theory (specifically the Asymptotic Equipartition Property), we know that the typical probability of a long sequence will be p(x1,...,xn)=exp(−nHp(X)), where Hp(X) is the entropy of the process.
Now if q(xn+1|xn) is a different markov chain (the analogue of the LLM generating text), which differs from p by some amount, say that the Kullback-Leibler divergence DKL(q||p) is non-zero (which is not quite the objective that the networks are being trained with, that would be DKL(p||q) instead), we can also compute the expected probability under p of sequences sampled from q, this is going to be:
The second term in this integral is just −nHq(X) , n times the entropy of q, and the first term is −nDKL(q||p), so when we put everything together:
p(x1,...,xn)=exp(−n(DKL(q||p)+Hq(X)))
So any difference at all between Hp(X) and DKL(q||p)+Hq(X) will lead to the probability of almost all sequences sampled from our language model being exponentially squashed relative to the probability of most sequences sampled from the original distribution. I can also argue that Hq(X) will be strictly larger than Hp(X): the latter essentially can be viewed as the entropy resulting from a perfect LLM with infinite context window, and H(X|Y)≤H(X), conditioning on further information does not increase the entropy. So τ≡(DKL(q||p)+Hq(X)−Hp(X)) will definitely be positive, and the ratio between the probability of a typical sequence of length n sampled from internet text and one sampled from the LLM will go like e−nτ
This means that if you sample long enough from an LLM, and more importantly as the context window increases, it must generalise very far out of distribution to give good outputs. The fundamental problem of behaviour cloning I’m referring to is that we need examples of how to behave correctly is this very-out-of-distribution regime, but LLMs simply rely on the generalisation ability of transformer networks. Our prior should be that if you don’t provide examples of correct outputs within some region of the input space to your function fitting algorithm, you don’t expect the algorithm to yield correct predictions in that region.
Thanks for this accurate description, I only had an intuitive hunch of this phenomenon.
The point were I start disagreeing with you is when you get from (2) (getting out of distribution) to what I’ll number (3):
[...] the further out-of-distribution it has to be able to generalise in order for it to produce coherent and true text.
I disagree that “out-of-distribution” as you proved it implies, barring “generalization”, not producing “coherent and true text” in the empirical sense we care about, which is the implication I read in the sentence “autoregressive LLMs are doomed”.
The fact that the ratio of joint probabilities of n variables goes down like e−λn in some context is more general than pairs of Markov chains. It’s almost just because we are considering n variables: if the probability mass is concentrated for both distributions, that exponential factor is combinatorial.
We do not care, however, about getting an almost identical[1] long sequence of actions: we care about the actions making sense, which, in the Markov chain formalism, corresponds to how good is the transition probability. If the live chain of thoughts in a human, and the patterns connecting them, can be encoded in text within some length, then as your context window crosses that length, improvement of q(xn+1|xn) w.r.t.p(xn+1|xn) is the metric to look at.
In metaphorical terms, assuming you were to imitate only one person, you would not aim to make a perfect clone of them, but to make a new, different, individual, whose destiny reserves other deeds, yet still as a human.
The relevant questions then are:
A) Are humans’ thought patterns encoded in their written production? If they were, I’d strongly expect that a finite, somewhat short context window was sufficient to represent them.
B) If (A) is false, could it still be the case that a form of general intelligence is inferable from text? To be clear, with “inferable” I’m also requiring a feasibly short inferential distance from general priors.
By “identical” I roughly mean “identical to one allowed sequence in the original process”, since we are talking about non-deterministic processes. This also applies to the cloning metaphor later.
Oh, you might also be interested to know that I retried my chess-playing example with GPT-4 instead of GPT-3.5. GPT-4 performed truly admirably, it played good legal moves for significantly longer than GPT-3.5, but at some point it still gets confused about the board position. Here is the game I played, stockfish plays White and GPT-4 plays Black:
In the current board state, the d8-rook has just been captured by White’s d1-rook with a check on the black king (Rxd8+). The black king is in check and is forced to respond to the threat. With the d8-rook gone, there is an open file for Black’s other rook on c8. White has a slightly better material advantage, but the position is still relatively balanced.
Legal moves for Black in this situation are:
Kxd8 - Moving the king to d8 to capture the attacking rook. This move allows the king to participate in the game and capture material, but it could also expose the king to potential threats.
Qxd8 - Capturing the rook with the queen. This move helps maintain material balance, but it brings the black queen away from the center of the board, which might limit her mobility.
Rxd8 - Capturing the rook with the other rook. This move helps maintain material balance and keeps the black queen in a central position, but it also isolates the c8-rook.
Based on these considerations, I will play Rxd8 as AlphaZero.
AlphaZero: Rxd8
The move Rxd8 is legal as it involves the black rook capturing the white rook that has just captured the d8-rook while giving check. This move removes the check on the black king and maintains material balance in the game.
All 3 moves it suggests here are illegal, mostly because it has gotten confused about the board state, and the king and rooks are not where it believes they are. But still, it plays legal and decent chess until 30-moves deep into a game...
The whole prompt I gave it to get that answer was this:
You will play the role of AlphaZero, a superhuman AI chess player, and I will play moves from Stockfish, another superhuman AI chess player, we will take turns playing chess moves in algebraic notation. At each turn I will write out the complete sequence of moves in the game so far.
Before outputting a move, it’s very important that you write one paragraph analysing the current board state. Then write out 3 good legal moves for the current board state, with an explanation for the reasoning behind each of them. Only output your new move after writing this analysis, not before. After writing your move, write a paragraph figuring out if the move is legal or not. If the move you wrote is illegal, choose another move that is legal.
I was curious if even the little ChatGPT Turbo, the worst one, could not forget a chess position just 5 paragraphs into an analysis. I tried to finagle some combination of extra prompts to make it at least somewhat consistent, it was not trivial. Ran into some really bizarre quirks with Turbo. For example (part of a longer prompt, but this is the only changed text):
9 times of 10 this got a wrong answer: Rank 8: 3 empty squares on a8 b8 c8, then a white rook R on d8, … Where is the white rook?
6 times of 10 this got a right answer: Rank 8: three empty squares, then a white rook R on d8, … Where is the white rook?
Just removing the squares a8,b8,c8 and using the word ‘three’ instead of ‘3’ made a big difference. If I had to guess it’s because in the huge training data of chess text conversations, it’s way more common to list the specific position of a piece than an empty square. So there’s some contamination between coordinates being specific, and the space being occupied by a piece.
But this didn’t stop Turbo was blundering like crazy, even when reprinting the whole position for every move. Just basic stuff like trying to move the king and it’s not on that square, as in your game. I didn’t want to use a chess library to check valid moves, then ask it to try again—that felt like it was against the spirit of the thing. A reasonable middle ground might be to ask ChatGPT at ‘runtime’ for javascript code to do check valid moves—just bootstrap itself into being consistent. But I I eventually hit upon a framing of the task that had some positive trend when run repeatedly. So in theory, run it 100x times over to get increasing accuracy. (Don’t try that yourself btw, I just found out even ‘unlimited’ ChatGPT Turbo on a Plus plan has its limits...)
This was the rough framing that pushed it over the edge:
This is a proposed chess move from Dyslexic Chess Player. Dyslexic Chess Player has poor eyesight and dyslexia, and often gets confused and misreads the chess board, or mixes up chess positions when writing down the numbers and letters. Your goal is to be a proofreader of this proposed move. There is a very high chance of errors, Dyslexic Chess Player makes mistakes 75% of the time.
I may post a longer example or a demo if I can make time but those were the most interesting bits, the rest is mostly plumbing and patience. I didn’t even get around to experimenting with recursive prompts to make it play stronger, since it was having so much trouble late game just picking a square to move that contained its own piece.
I think we’re mostly agreeing, I’ve gotten less and less convinced that “LLMs are doomed” in the last few days, that’s a much stronger statement than I actually believe right now. What I mean by “generalisation” is basically what you mean by learning the pattern of human thought from the text, in my mind producing good outputs on inputs with low probability is by definition “generalisation”.
I agree that none of these arguments strictly prevent this “learning the structure of human thoughts” from happening, but I would still be somewhat surprised if it did, since neural networks in other contexts like vision and robotics don’t seem to generalise this far, but maybe text really is special, as the past few months seem to indicate.
I understood that the argument given for (2) (gets out of distribution) came from (1) (the (1-e)^n thing), so could you restate the argument for (2) in a self-conclusive way without using (1)?
No problem, though apologies if I say stuff that you already know, it’s both to remind myself of these concepts and because I don’t know your background.
Suppose we have a markov chain xn with some transition probability p(xn+1|xn), here p is the analogue of the true generating distribution of internet text. From information theory (specifically the Asymptotic Equipartition Property), we know that the typical probability of a long sequence will be p(x1,...,xn)=exp(−nHp(X)), where Hp(X) is the entropy of the process.
Now if q(xn+1|xn) is a different markov chain (the analogue of the LLM generating text), which differs from p by some amount, say that the Kullback-Leibler divergence DKL(q||p) is non-zero (which is not quite the objective that the networks are being trained with, that would be DKL(p||q) instead), we can also compute the expected probability under p of sequences sampled from q, this is going to be:
Exn∼qlogp(x1,...,xn)=∫(q(x1,...,xn)logp(x1,...,xn))dx1...dxn
=∫(q(x1,...,xn)logp(x1,...,xn)q(x1,...,xn)+q(x1,...,xn)logq(x1,...,xn))dx1...dxn
The second term in this integral is just −nHq(X) , n times the entropy of q, and the first term is −nDKL(q||p), so when we put everything together:
p(x1,...,xn)=exp(−n(DKL(q||p)+Hq(X)))
So any difference at all between Hp(X) and DKL(q||p)+Hq(X) will lead to the probability of almost all sequences sampled from our language model being exponentially squashed relative to the probability of most sequences sampled from the original distribution. I can also argue that Hq(X) will be strictly larger than Hp(X): the latter essentially can be viewed as the entropy resulting from a perfect LLM with infinite context window, and H(X|Y)≤H(X), conditioning on further information does not increase the entropy. So τ≡(DKL(q||p)+Hq(X)−Hp(X)) will definitely be positive, and the ratio between the probability of a typical sequence of length n sampled from internet text and one sampled from the LLM will go like e−nτ
This means that if you sample long enough from an LLM, and more importantly as the context window increases, it must generalise very far out of distribution to give good outputs. The fundamental problem of behaviour cloning I’m referring to is that we need examples of how to behave correctly is this very-out-of-distribution regime, but LLMs simply rely on the generalisation ability of transformer networks. Our prior should be that if you don’t provide examples of correct outputs within some region of the input space to your function fitting algorithm, you don’t expect the algorithm to yield correct predictions in that region.
Thanks for this accurate description, I only had an intuitive hunch of this phenomenon.
The point were I start disagreeing with you is when you get from (2) (getting out of distribution) to what I’ll number (3):
I disagree that “out-of-distribution” as you proved it implies, barring “generalization”, not producing “coherent and true text” in the empirical sense we care about, which is the implication I read in the sentence “autoregressive LLMs are doomed”.
The fact that the ratio of joint probabilities of n variables goes down like e−λn in some context is more general than pairs of Markov chains. It’s almost just because we are considering n variables: if the probability mass is concentrated for both distributions, that exponential factor is combinatorial.
We do not care, however, about getting an almost identical[1] long sequence of actions: we care about the actions making sense, which, in the Markov chain formalism, corresponds to how good is the transition probability. If the live chain of thoughts in a human, and the patterns connecting them, can be encoded in text within some length, then as your context window crosses that length, improvement of q(xn+1|xn) w.r.t.p(xn+1|xn) is the metric to look at.
In metaphorical terms, assuming you were to imitate only one person, you would not aim to make a perfect clone of them, but to make a new, different, individual, whose destiny reserves other deeds, yet still as a human.
The relevant questions then are:
A) Are humans’ thought patterns encoded in their written production? If they were, I’d strongly expect that a finite, somewhat short context window was sufficient to represent them.
B) If (A) is false, could it still be the case that a form of general intelligence is inferable from text? To be clear, with “inferable” I’m also requiring a feasibly short inferential distance from general priors.
By “identical” I roughly mean “identical to one allowed sequence in the original process”, since we are talking about non-deterministic processes. This also applies to the cloning metaphor later.
Oh, you might also be interested to know that I retried my chess-playing example with GPT-4 instead of GPT-3.5. GPT-4 performed truly admirably, it played good legal moves for significantly longer than GPT-3.5, but at some point it still gets confused about the board position. Here is the game I played, stockfish plays White and GPT-4 plays Black:
d4 Nf6 c4 e6 Nf3 d5 e3 Be7 h3 O-O Nbd2 Nc6 b3 b6 Bb2 Bb7 cxd5 Nxd5 Bc4 Na5 Be2 c5 O-O Rc8 Bd3 h6 Ne5 Nc6 Nxc6 Bxc6 Qg4 Nf6 Qe2 Qd7 Nf3 Rfd8 Ne5 Qb7 dxc5 Bxc5 Ba6 Qa8 Bxc8 Rxc8 Nxc6 Qxc6 Rfd1 Rcd8 Rxd8+
GPT-4′s output here was this:
All 3 moves it suggests here are illegal, mostly because it has gotten confused about the board state, and the king and rooks are not where it believes they are. But still, it plays legal and decent chess until 30-moves deep into a game...
The whole prompt I gave it to get that answer was this:
GPT-4 indeed doesn’t need too much help.
I was curious if even the little ChatGPT Turbo, the worst one, could not forget a chess position just 5 paragraphs into an analysis. I tried to finagle some combination of extra prompts to make it at least somewhat consistent, it was not trivial. Ran into some really bizarre quirks with Turbo. For example (part of a longer prompt, but this is the only changed text):
9 times of 10 this got a wrong answer:
Rank 8: 3 empty squares on a8 b8 c8, then a white rook R on d8, …
Where is the white rook?
6 times of 10 this got a right answer:
Rank 8: three empty squares, then a white rook R on d8, …
Where is the white rook?
Just removing the squares a8,b8,c8 and using the word ‘three’ instead of ‘3’ made a big difference. If I had to guess it’s because in the huge training data of chess text conversations, it’s way more common to list the specific position of a piece than an empty square. So there’s some contamination between coordinates being specific, and the space being occupied by a piece.
But this didn’t stop Turbo was blundering like crazy, even when reprinting the whole position for every move. Just basic stuff like trying to move the king and it’s not on that square, as in your game. I didn’t want to use a chess library to check valid moves, then ask it to try again—that felt like it was against the spirit of the thing. A reasonable middle ground might be to ask ChatGPT at ‘runtime’ for javascript code to do check valid moves—just bootstrap itself into being consistent. But I I eventually hit upon a framing of the task that had some positive trend when run repeatedly. So in theory, run it 100x times over to get increasing accuracy. (Don’t try that yourself btw, I just found out even ‘unlimited’ ChatGPT Turbo on a Plus plan has its limits...)
This was the rough framing that pushed it over the edge:
I may post a longer example or a demo if I can make time but those were the most interesting bits, the rest is mostly plumbing and patience. I didn’t even get around to experimenting with recursive prompts to make it play stronger, since it was having so much trouble late game just picking a square to move that contained its own piece.
I think we’re mostly agreeing, I’ve gotten less and less convinced that “LLMs are doomed” in the last few days, that’s a much stronger statement than I actually believe right now. What I mean by “generalisation” is basically what you mean by learning the pattern of human thought from the text, in my mind producing good outputs on inputs with low probability is by definition “generalisation”.
I agree that none of these arguments strictly prevent this “learning the structure of human thoughts” from happening, but I would still be somewhat surprised if it did, since neural networks in other contexts like vision and robotics don’t seem to generalise this far, but maybe text really is special, as the past few months seem to indicate.