I’m not sure I understand correctly: you are saying that it’s easier for the LLM to stay coherent in the relevant sense than it looks from LeCun’s argument, right?
The error margin LeCun used is for independent probabilities, while in an LLM the paths that the output takes become strongly dependent on each other to stay consistent within a path. Once an LLM masters grammar and use of language, it stays consistent within the path. However, you get the same problem, but now repeated on a larger scale.
Think about it as tiling the plane using a building block which you use to create larger building blocks. The larger building block has similar properties but at a larger scale, so errors propagate, but slower.
If you use independent probabilities, then it is easy to make it look as if LLMs are diverging quickly, but they are not doing that in practice. Also, if you have 99% of selecting one token as output, then this is not observed by the LLM predicting next token. It “observes” its previous token as 100%. There is a chance that the LLM produced the wrong token, but in contexts where the path is predictable, it will learn to output tokens consistently enough to produce coherence.
Humans often think about probabilities as inherently uncertain, because we have not evolved the intuition for nuance to understand probabilities at a deeper level. When an AI outputs actions, probabilities might be thought of as an “interpretation” of the action over possible worlds, while the action itself is a certain outcome in the actual world.
I’m not sure I understand correctly: you are saying that it’s easier for the LLM to stay coherent in the relevant sense than it looks from LeCun’s argument, right?
The error margin LeCun used is for independent probabilities, while in an LLM the paths that the output takes become strongly dependent on each other to stay consistent within a path. Once an LLM masters grammar and use of language, it stays consistent within the path. However, you get the same problem, but now repeated on a larger scale.
Think about it as tiling the plane using a building block which you use to create larger building blocks. The larger building block has similar properties but at a larger scale, so errors propagate, but slower.
If you use independent probabilities, then it is easy to make it look as if LLMs are diverging quickly, but they are not doing that in practice. Also, if you have 99% of selecting one token as output, then this is not observed by the LLM predicting next token. It “observes” its previous token as 100%. There is a chance that the LLM produced the wrong token, but in contexts where the path is predictable, it will learn to output tokens consistently enough to produce coherence.
Humans often think about probabilities as inherently uncertain, because we have not evolved the intuition for nuance to understand probabilities at a deeper level. When an AI outputs actions, probabilities might be thought of as an “interpretation” of the action over possible worlds, while the action itself is a certain outcome in the actual world.