I’m curious for opinions on what I think is a crux of Leopold’s “Situational Awareness”:
picking the many obvious low-hanging fruit on “unhobbling” gains should take us from chatbots to agents, from a tool to something that looks more like drop-in remote worker replacements.[1]
This disagrees with my own intuition—the gap between chatbot and agent seems stubbornly large. He suggests three main angles of improvement:[2]
Large context windows allowing for fully “onboarding” LLMs to a job or task
Increased inference-time compute allowing for building ‘System 2’ reasoning abilities
Enabling full computer access
We already have pretty large context windows (which has been surprising to me, admittedly), but they’ve helped less than I expected—I mostly just don’t need to move relevant code right next to my cursor as much when using Copilot. I haven’t seen really powerful use cases; the closest is probably Devin, but that doesn’t work very well. Using large context windows on documents does reasonably well, but LLMs are too unreliable, biased towards the generic, and memoryless to get solid benefit out of that, in my personal experience.
Put another way, I think large context windows are of pretty limited benefit when LLMs have poor working memory and can’t keep properly track of what they’re doing over the course of their output.
That leads into the inference-time compute argument, both the weakest and the most essential. By my understanding, the goal is to give LLMs a working memory, but how we get there seems really fuzzy. The idea presented is to produce OOMs more tokens, and keep them on-track, but “keep them on-track” part in his writing feels like merely a restatement of the problem to me. The only substantial suggestion I can see is this single line:
Perhaps a small amount of RL helps a model learn to error correct (“hm, that doesn’t look right, let me double check that”), make plans, search over possible solutions, and so on.[3]
And in a footnote on the same page he acknowledges:
Unlocking this capability will require a new kind of training, for it to learn these extra skills.
Not trivial or baked-in to current AI progress, I think? Maybe I’m misunderstanding something.
As far as for enabling full computer access—yeah multi-modal models should allow this within a few years, but it remains of limited benefit if the working memory problem isn’t solved.
I’m curious for opinions on what I think is a crux of Leopold’s “Situational Awareness”:
This disagrees with my own intuition—the gap between chatbot and agent seems stubbornly large. He suggests three main angles of improvement:[2]
Large context windows allowing for fully “onboarding” LLMs to a job or task
Increased inference-time compute allowing for building ‘System 2’ reasoning abilities
Enabling full computer access
We already have pretty large context windows (which has been surprising to me, admittedly), but they’ve helped less than I expected—I mostly just don’t need to move relevant code right next to my cursor as much when using Copilot. I haven’t seen really powerful use cases; the closest is probably Devin, but that doesn’t work very well. Using large context windows on documents does reasonably well, but LLMs are too unreliable, biased towards the generic, and memoryless to get solid benefit out of that, in my personal experience.
Put another way, I think large context windows are of pretty limited benefit when LLMs have poor working memory and can’t keep properly track of what they’re doing over the course of their output.
That leads into the inference-time compute argument, both the weakest and the most essential. By my understanding, the goal is to give LLMs a working memory, but how we get there seems really fuzzy. The idea presented is to produce OOMs more tokens, and keep them on-track, but “keep them on-track” part in his writing feels like merely a restatement of the problem to me. The only substantial suggestion I can see is this single line:
And in a footnote on the same page he acknowledges:
Not trivial or baked-in to current AI progress, I think? Maybe I’m misunderstanding something.
As far as for enabling full computer access—yeah multi-modal models should allow this within a few years, but it remains of limited benefit if the working memory problem isn’t solved.
Page 9 of the PDF.
Pages 34-37 of the PDF.
Page 36 of the PDF.
I think this will be done via multi-agent architectures (“society of mind” over an LLM).
This does require plenty of calls to an LLM, so plenty of inference time compute.
For example, the current leader of https://huggingface.co/spaces/gaia-benchmark/leaderboard is this relatively simple multi-agent concoction by a Microsoft group: https://github.com/microsoft/autogen/tree/gaia_multiagent_v01_march_1st/samples/tools/autogenbench/scenarios/GAIA/Templates/Orchestrator
I think that cutting-edge in this direction is probably non-public at this point (which makes a lot of sense).