I don’t think this path is easy; I think immense effort and money will be directed at it by default, since there’s so much money to be made by replacing human labor with agents. And I think no breakthroughs are necessary, just work in fairly obvious directions. That’s why I think this is likely to lead to human-level agents.
I don’t think it would take insane amounts of compute, but compute costs will be substantial. They’ll be roughly like costs for OpenAIs Operator; it runs autonomously, making calls to frontier LLMs and vision models essentially continuously. Costs are low enough that $200/month covers unlimited use. (although that thing is so useless people probably aren’t using it much. So the compute costs of o1 pro thinking away continuously are probably a better indicator; Altman said $200/mo doesn’t quite cover the average, driven by some users keeping as many going constantly as they can.
It can’t all be fit into a context window for complex tasks. And it’s costly even when the whole task would fit. That’s why additional memory systems are needed. There are already context window management techniques in play for existing limited agents. And RAG systems seem to already be adequate to serve as episodic memory; humans use much fewer memory “tokens” to accomplish complex tasks than the large amount of documentation stored in current RAG systems used for non-agentic retrieval assisted generation of answers to questions that rely on documented information.
So I’d estimate something like $20-30 for an agent to run all day. This could come down a lot if you managed to have many of its calls use smaller/cheaper LLMs than whatever is the current latest and greatest.
Humans train themselves to act agentically by assembling small skills (pick up the food and put it in your mouth, run forward, look for tracks) into long time horizon tasks (hunting). We do not learn by performing RL on long sequences and applying the learning to everything we did to get there. We do something like RL, but it’s very targeted on specific hypotheses that we produced about how to accomplish this task.
Thinking about how humans learn new tasks provides a pretty direct analogy. We make explicit hypotheses about what we need to learn, then specific strategies for learning it. This is as an adult; the pretraining of the LLM gives roughly adult-level performance of simple tasks that were well-represented in the training set.
Claude playing Pokemon is bad in large part because it has no episodic memory, a great illustration. It wouldn’t be great even with it. It would need a real self-directed learning process. People have only barely started to implement these (to my limited knowledge- some company in stealth mode might be well along, but I doubt it—they’d need to be public to get adequate funding for real progress.
Hallucinations are much less of an issue in current-gen LLMs than older generations. They’re still an issue. Agents would need to do what humans do: ask themselves “am I sure? How can I check?” for important pieces of information. The human brain hallucinates just like LLMs if you just go with the first answer that springs to mind like LLMs usually do. You need to implement a routine for deciding which knowledge is important and how you can use multiple sources of information and thinking to check if it’s right. Humans do this only by learning cognitive strategies; kids do accept hallucinations and so are pretty useless for getting things done :), just like current LLM agents.
I don’t think this path is easy; I think immense effort and money will be directed at it by default, since there’s so much money to be made by replacing human labor with agents. And I think no breakthroughs are necessary, just work in fairly obvious directions. That’s why I think this is likely to lead to human-level agents.
I don’t think it would take insane amounts of compute, but compute costs will be substantial. They’ll be roughly like costs for OpenAIs Operator; it runs autonomously, making calls to frontier LLMs and vision models essentially continuously. Costs are low enough that $200/month covers unlimited use. (although that thing is so useless people probably aren’t using it much. So the compute costs of o1 pro thinking away continuously are probably a better indicator; Altman said $200/mo doesn’t quite cover the average, driven by some users keeping as many going constantly as they can.
It can’t all be fit into a context window for complex tasks. And it’s costly even when the whole task would fit. That’s why additional memory systems are needed. There are already context window management techniques in play for existing limited agents. And RAG systems seem to already be adequate to serve as episodic memory; humans use much fewer memory “tokens” to accomplish complex tasks than the large amount of documentation stored in current RAG systems used for non-agentic retrieval assisted generation of answers to questions that rely on documented information.
So I’d estimate something like $20-30 for an agent to run all day. This could come down a lot if you managed to have many of its calls use smaller/cheaper LLMs than whatever is the current latest and greatest.
Humans train themselves to act agentically by assembling small skills (pick up the food and put it in your mouth, run forward, look for tracks) into long time horizon tasks (hunting). We do not learn by performing RL on long sequences and applying the learning to everything we did to get there. We do something like RL, but it’s very targeted on specific hypotheses that we produced about how to accomplish this task.
Thinking about how humans learn new tasks provides a pretty direct analogy. We make explicit hypotheses about what we need to learn, then specific strategies for learning it. This is as an adult; the pretraining of the LLM gives roughly adult-level performance of simple tasks that were well-represented in the training set.
Claude playing Pokemon is bad in large part because it has no episodic memory, a great illustration. It wouldn’t be great even with it. It would need a real self-directed learning process. People have only barely started to implement these (to my limited knowledge- some company in stealth mode might be well along, but I doubt it—they’d need to be public to get adequate funding for real progress.
Hallucinations are much less of an issue in current-gen LLMs than older generations. They’re still an issue. Agents would need to do what humans do: ask themselves “am I sure? How can I check?” for important pieces of information. The human brain hallucinates just like LLMs if you just go with the first answer that springs to mind like LLMs usually do. You need to implement a routine for deciding which knowledge is important and how you can use multiple sources of information and thinking to check if it’s right. Humans do this only by learning cognitive strategies; kids do accept hallucinations and so are pretty useless for getting things done :), just like current LLM agents.