There was a type of guy circa 2021 that basically said that gpt-3 etc. was cool, but we should be cautious about assuming everything was going to change, because the context limitation was a key bottleneck that might never be overcome. That guy’s take was briefly “discredited” in subsequent years when LLM companies increased context lengths to 100k, 200k tokens.
I think that was premature. The context limitations (in particular the lack of an equivalent to human long term memory) are the key deficit of current LLMs and we haven’t really seen much improvement at all.
Why do you think they haven’t seen much improvement?
I agree that the “effective” high-understanding context length—as in, what the model can really process more deeply than just doing retrieval on small chunks—is substantially less than 200k tokens. But, why do you think this hasn’t improved much?
I think our best proxy for this is how well AIs do on long agentic tasks and this appears to be improving pretty fast. This isn’t a very isolated proxy as it includes other aspects of capabilities, but performance on these tasks does seem like it should be upper bounded by effective usage of long context, at least to the extent that effective usage of long context is a bottleneck for anything.
I think some long tasks are like a long list of steps that only require the output of the most recent step, and so they don’t really need long context. AI improves at those just by becoming more reliable and making fewer catastrophic mistakes. On the other hand, some tasks need the AI to remember and learn from everything it’s done so far, and that’s where it struggles- see how Claude Plays Pokémon gets stuck in loops and has to relearn things dozens of times.
I haven’t read the METR paper in full, but from the examples given I’m worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it’s also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it’s debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project “an hour” to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.
This seems plausible to me but I could also imagine the opposite being true: my working memory is way smaller than the context window of most models. LLMs would destroy me at a task which “merely” required you to memorize 100k tokens and not do any reasoning; I would do comparatively better at a project which was fairly small but required a bunch of different steps.
Not just long context in general (that can be partially mitigated with RAG or even BM25/tf-idf search), but also nearly 100% factual accuracy on it, as I argued last week
There was a type of guy circa 2021 that basically said that gpt-3 etc. was cool, but we should be cautious about assuming everything was going to change, because the context limitation was a key bottleneck that might never be overcome. That guy’s take was briefly “discredited” in subsequent years when LLM companies increased context lengths to 100k, 200k tokens.
I think that was premature. The context limitations (in particular the lack of an equivalent to human long term memory) are the key deficit of current LLMs and we haven’t really seen much improvement at all.
Why do you think they haven’t seen much improvement?
I agree that the “effective” high-understanding context length—as in, what the model can really process more deeply than just doing retrieval on small chunks—is substantially less than 200k tokens. But, why do you think this hasn’t improved much?
I think our best proxy for this is how well AIs do on long agentic tasks and this appears to be improving pretty fast. This isn’t a very isolated proxy as it includes other aspects of capabilities, but performance on these tasks does seem like it should be upper bounded by effective usage of long context, at least to the extent that effective usage of long context is a bottleneck for anything.
I think some long tasks are like a long list of steps that only require the output of the most recent step, and so they don’t really need long context. AI improves at those just by becoming more reliable and making fewer catastrophic mistakes. On the other hand, some tasks need the AI to remember and learn from everything it’s done so far, and that’s where it struggles- see how Claude Plays Pokémon gets stuck in loops and has to relearn things dozens of times.
I haven’t read the METR paper in full, but from the examples given I’m worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it’s also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it’s debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project “an hour” to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.
This seems plausible to me but I could also imagine the opposite being true: my working memory is way smaller than the context window of most models. LLMs would destroy me at a task which “merely” required you to memorize 100k tokens and not do any reasoning; I would do comparatively better at a project which was fairly small but required a bunch of different steps.
Not just long context in general (that can be partially mitigated with RAG or even BM25/tf-idf search), but also nearly 100% factual accuracy on it, as I argued last week