AIXI is an idealized model of a superintelligent agent that combines “perfect” prediction (Solomonoff Induction) with “perfect” decision-making (sequential decision theory).
OpenAI’s o1 is a real-world “reasoning model” that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).
To be clear: o1 is no AIXI. But AIXI, as an ideal, can teach us something about the future of o1-like systems.
AIXI teaches us that agency is simple. It involves just two raw ingredients: prediction and decision-making. And we know how to produce these ingredients. Good predictions come from self-supervised learning, an art we have begun to master over the last decade of scaling pretraining. Good decisions come from search, which has evolved from the explicit search algorithms that powered DeepBlue and AlphaGo to the implicit methods that drive AlphaZero and now o1.
So let’s call “reasoning models” like o1 what they really are: the first true AI agents. It’s not tool-use that makes an agent; it’s how that agent reasons. Bandwidth comes second.
Simple does not mean cheap: pretraining is an industrial process that costs (hundreds of) billions of dollars. Simple also does not mean easy: decision-making is especially difficult to get right since amortizing search (=training a model to perform implicit search) requires RL, which is notoriously tricky.
Simple does mean scalable. The original scaling laws taught us how to exchange compute for better predictions. The new test-time scaling laws teach us how to exchange compute for better decisions. AIXI may still be a ways off, but we can see at least one open path that leads closer to that ideal.
The bitter lesson is that “general methods that leverage computation [such as search and learning] are ultimately the most effective, and by a large margin.” The lesson from AIXI is that maybe these are all you need. The lesson from o1 is that maybe all that’s left is just a bit more compute...
We still don’t know the exact details of how o1 works. If you’re interested in reading about hypotheses for what might be going on and further discussion of the implications for scaling and recursive self-improvement, see my recent post, “o1: A Technical Primer”
You are skipping over a very important component: Evaluation.
Which is exactly what we don’t know how to do well enough outside of formally verifiable domains like math and code, which is exactly where o1 shows big performance jumps.
So let’s call “reasoning models” like o1 what they really are: the first true AI agents.
I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between “reasoning” and “chat” models, and I’d prefer to use “agent” for that distinction.
I do think that “reasoning” is a bit of a market-y name for this category of system though. “chat” vs “base” is a great choice of words, and “chat” is basically just a description of the RL objective those models were trained with.
If I were the terminology czar, I’d call o1 a “task” model.
Agency = Prediction + Decision.
AIXI is an idealized model of a superintelligent agent that combines “perfect” prediction (Solomonoff Induction) with “perfect” decision-making (sequential decision theory).
OpenAI’s o1 is a real-world “reasoning model” that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).
To be clear: o1 is no AIXI. But AIXI, as an ideal, can teach us something about the future of o1-like systems.
AIXI teaches us that agency is simple. It involves just two raw ingredients: prediction and decision-making. And we know how to produce these ingredients. Good predictions come from self-supervised learning, an art we have begun to master over the last decade of scaling pretraining. Good decisions come from search, which has evolved from the explicit search algorithms that powered DeepBlue and AlphaGo to the implicit methods that drive AlphaZero and now o1.
So let’s call “reasoning models” like o1 what they really are: the first true AI agents. It’s not tool-use that makes an agent; it’s how that agent reasons. Bandwidth comes second.
Simple does not mean cheap: pretraining is an industrial process that costs (hundreds of) billions of dollars. Simple also does not mean easy: decision-making is especially difficult to get right since amortizing search (=training a model to perform implicit search) requires RL, which is notoriously tricky.
Simple does mean scalable. The original scaling laws taught us how to exchange compute for better predictions. The new test-time scaling laws teach us how to exchange compute for better decisions. AIXI may still be a ways off, but we can see at least one open path that leads closer to that ideal.
The bitter lesson is that “general methods that leverage computation [such as search and learning] are ultimately the most effective, and by a large margin.” The lesson from AIXI is that maybe these are all you need. The lesson from o1 is that maybe all that’s left is just a bit more compute...
We still don’t know the exact details of how o1 works. If you’re interested in reading about hypotheses for what might be going on and further discussion of the implications for scaling and recursive self-improvement, see my recent post, “o1: A Technical Primer”
You are skipping over a very important component: Evaluation.
Which is exactly what we don’t know how to do well enough outside of formally verifiable domains like math and code, which is exactly where o1 shows big performance jumps.
Marcus Hutter on AIXI and ASI safety
I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between “reasoning” and “chat” models, and I’d prefer to use “agent” for that distinction.
I do think that “reasoning” is a bit of a market-y name for this category of system though. “chat” vs “base” is a great choice of words, and “chat” is basically just a description of the RL objective those models were trained with.
If I were the terminology czar, I’d call o1 a “task” model.