it’s hard to point at any particular algorithmic improvement or model and say that it was key to the success of modern LLMs
I actually looked into that recently. My initial guess was this was about “the context window” as a concept. It allows to keep vast volumes of task-relevant information around, including the outputs of the model’s own past computations, without lossily compressing that information into a small representation (like with RNNs). I asked OpenAI’s DR about it, and its output seems to support that guess.
In retrospect, it makes sense that this would work better. If you don’t know what challenges you’re going to face in the future, you don’t necessarily know what past information to keep around, so a fixed-size internal state was a bad idea.
I actually looked into that recently. My initial guess was this was about “the context window” as a concept. It allows to keep vast volumes of task-relevant information around, including the outputs of the model’s own past computations, without lossily compressing that information into a small representation (like with RNNs). I asked OpenAI’s DR about it, and its output seems to support that guess.
In retrospect, it makes sense that this would work better. If you don’t know what challenges you’re going to face in the future, you don’t necessarily know what past information to keep around, so a fixed-size internal state was a bad idea.