AIXI is an idealized model of a superintelligent agent that combines “perfect” prediction (Solomonoff Induction) with “perfect” decision-making (sequential decision theory).
OpenAI’s o1 is a real-world “reasoning model” that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).
To be clear: o1 is no AIXI. But AIXI, as an ideal, can teach us something about the future of o1-like systems.
AIXI teaches us that agency is simple. It involves just two raw ingredients: prediction and decision-making. And we know how to produce these ingredients. Good predictions come from self-supervised learning, an art we have begun to master over the last decade of scaling pretraining. Good decisions come from search, which has evolved from the explicit search algorithms that powered DeepBlue and AlphaGo to the implicit methods that drive AlphaZero and now o1.
So let’s call “reasoning models” like o1 what they really are: the first true AI agents. It’s not tool-use that makes an agent; it’s how that agent reasons. Bandwidth comes second.
Simple does not mean cheap: pretraining is an industrial process that costs (hundreds of) billions of dollars. Simple also does not mean easy: decision-making is especially difficult to get right since amortizing search (=training a model to perform implicit search) requires RL, which is notoriously tricky.
Simple does mean scalable. The original scaling laws taught us how to exchange compute for better predictions. The new test-time scaling laws teach us how to exchange compute for better decisions. AIXI may still be a ways off, but we can see at least one open path that leads closer to that ideal.
The bitter lesson is that “general methods that leverage computation [such as search and learning] are ultimately the most effective, and by a large margin.” The lesson from AIXI is that maybe these are all you need. The lesson from o1 is that maybe all that’s left is just a bit more compute...
We still don’t know the exact details of how o1 works. If you’re interested in reading about hypotheses for what might be going on and further discussion of the implications for scaling and recursive self-improvement, see my recent post, “o1: A Technical Primer”
You are skipping over a very important component: Evaluation.
Which is exactly what we don’t know how to do well enough outside of formally verifiable domains like math and code, which is exactly where o1 shows big performance jumps.
So let’s call “reasoning models” like o1 what they really are: the first true AI agents.
I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between “reasoning” and “chat” models, and I’d prefer to use “agent” for that distinction.
I do think that “reasoning” is a bit of a market-y name for this category of system though. “chat” vs “base” is a great choice of words, and “chat” is basically just a description of the RL objective those models were trained with.
If I were the terminology czar, I’d call o1 a “task” model.
When people complain about LLMs doing nothing more than interpolation, they’re mixing up two very different ideas: interpolation as intersecting every point in the training data, and interpolation as predicting behavior in-domain rather than out-of-domain.
With language, interpolation-as-intersecting isn’t inherently good or bad—it’s all about how you do it. Just compare polynomial interpolation to piecewise-linear interpolation (the thing that ReLUs do).
Neural networks (NNs) are biased towards fitting simple piecewise functions, which is (locally) the least biased way to interpolate. The simplest function that intersects two points is the straight line.
In reality, we don’t even train LLMs long enough to hit that intersecting threshold. In this under-interpolated sweet spot, NNs seem to learn features from coarse to fine with increasing model size. E.g.: https://arxiv.org/abs/1903.03488
Bonus: this is what’s happening with double descent: Test loss goes down, then up, until you reach the interpolation threshold. At this point there’s only one interpolating solution, and it’s a bad fit. But as you increase model capacity further, you end up with many interpolating solutions, some of which generalize better than others.
Meanwhile, with interpolation-not-extrapolation NNs can and do extrapolate outside the convex hull of training samples. Again, the bias towards simple linear extrapolations is locally the least biased option. There’s no beating the polytopes.
Here I’ve presented the visuals in terms of regression, but the story is pretty similar for classification, where the function being fit is a classification boundary. In this case, there’s extra pressure to maximize margins, which further encourages generalization
The next time you feel like dunking on interpolation, remember that you just don’t have the imagination to deal with high-dimensional interpolation. Maybe keep it to yourself and go interpolate somewhere else.
Agency = Prediction + Decision.
AIXI is an idealized model of a superintelligent agent that combines “perfect” prediction (Solomonoff Induction) with “perfect” decision-making (sequential decision theory).
OpenAI’s o1 is a real-world “reasoning model” that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).
To be clear: o1 is no AIXI. But AIXI, as an ideal, can teach us something about the future of o1-like systems.
AIXI teaches us that agency is simple. It involves just two raw ingredients: prediction and decision-making. And we know how to produce these ingredients. Good predictions come from self-supervised learning, an art we have begun to master over the last decade of scaling pretraining. Good decisions come from search, which has evolved from the explicit search algorithms that powered DeepBlue and AlphaGo to the implicit methods that drive AlphaZero and now o1.
So let’s call “reasoning models” like o1 what they really are: the first true AI agents. It’s not tool-use that makes an agent; it’s how that agent reasons. Bandwidth comes second.
Simple does not mean cheap: pretraining is an industrial process that costs (hundreds of) billions of dollars. Simple also does not mean easy: decision-making is especially difficult to get right since amortizing search (=training a model to perform implicit search) requires RL, which is notoriously tricky.
Simple does mean scalable. The original scaling laws taught us how to exchange compute for better predictions. The new test-time scaling laws teach us how to exchange compute for better decisions. AIXI may still be a ways off, but we can see at least one open path that leads closer to that ideal.
The bitter lesson is that “general methods that leverage computation [such as search and learning] are ultimately the most effective, and by a large margin.” The lesson from AIXI is that maybe these are all you need. The lesson from o1 is that maybe all that’s left is just a bit more compute...
We still don’t know the exact details of how o1 works. If you’re interested in reading about hypotheses for what might be going on and further discussion of the implications for scaling and recursive self-improvement, see my recent post, “o1: A Technical Primer”
You are skipping over a very important component: Evaluation.
Which is exactly what we don’t know how to do well enough outside of formally verifiable domains like math and code, which is exactly where o1 shows big performance jumps.
Marcus Hutter on AIXI and ASI safety
I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between “reasoning” and “chat” models, and I’d prefer to use “agent” for that distinction.
I do think that “reasoning” is a bit of a market-y name for this category of system though. “chat” vs “base” is a great choice of words, and “chat” is basically just a description of the RL objective those models were trained with.
If I were the terminology czar, I’d call o1 a “task” model.
When people complain about LLMs doing nothing more than interpolation, they’re mixing up two very different ideas: interpolation as intersecting every point in the training data, and interpolation as predicting behavior in-domain rather than out-of-domain.
With language, interpolation-as-intersecting isn’t inherently good or bad—it’s all about how you do it. Just compare polynomial interpolation to piecewise-linear interpolation (the thing that ReLUs do).
Neural networks (NNs) are biased towards fitting simple piecewise functions, which is (locally) the least biased way to interpolate. The simplest function that intersects two points is the straight line.
In reality, we don’t even train LLMs long enough to hit that intersecting threshold. In this under-interpolated sweet spot, NNs seem to learn features from coarse to fine with increasing model size. E.g.: https://arxiv.org/abs/1903.03488
Bonus: this is what’s happening with double descent: Test loss goes down, then up, until you reach the interpolation threshold. At this point there’s only one interpolating solution, and it’s a bad fit. But as you increase model capacity further, you end up with many interpolating solutions, some of which generalize better than others.
Meanwhile, with interpolation-not-extrapolation NNs can and do extrapolate outside the convex hull of training samples. Again, the bias towards simple linear extrapolations is locally the least biased option. There’s no beating the polytopes.
Here I’ve presented the visuals in terms of regression, but the story is pretty similar for classification, where the function being fit is a classification boundary. In this case, there’s extra pressure to maximize margins, which further encourages generalization
The next time you feel like dunking on interpolation, remember that you just don’t have the imagination to deal with high-dimensional interpolation. Maybe keep it to yourself and go interpolate somewhere else.