Probably the most important blog post I’ve read this year.
Tl;Dr
Direct optimisers: systems that during inference directly choose actions to optimise some objective function.
E.g. AIXI, MCTS, other planning
Direct optimisers perform inference by answering the question: “what output (e.g. action/strategy) maximises or minimises this objective function ([discounted] cumulative return and loss respectively).
Amortised optimisers: systems that learn to approximate some function during training and perform inference by evaluating the output of the approximated function on their inputs.
E.g.: model free RL, LLMs, most supervised & self supervised(?) learning systems
Amortised optimisers can be seen as performing inference by answering the question “what output (e.g. action, probability distribution over tokens) does this learned function (policy, predictive model) return for this input (agent state, prompt).
Amortised optimisers evaluate a learned function, they don’t argmax/argmin anything.
[It’s called “amortised optimisation” because while learning the policy is expensive, the cost of inference is amortised over all evaluations of the learned policy.]
Some Commentary
Direct optimisation is much more sample efficient (an MCTS chess program can achieve optimal play with sufficient compute given only the rules of chess; an amortised chess program necessarily needs millions of games to learn from)
Direct optimisation is feasible in simple, deterministic, discrete, fully observable environments (e.g. tic-tac-toe, chess, go) but unwieldy in complex, stochastic, high dimensional environments (e.g. the real world). Some of the limitations of direct optimisation in rich environments seem complexity theoretic, so better algorithms won’t fix them
In practice some systems use a hybrid of the two approaches with most cognition performed in an amortised manner but planning deployed when necessary (e.g. system 2 vs system 1 in humans)
Limits of GPT
LLMs are almost purely amortised optimisers. Scaled up to superintelligence, they would still br amortised optimisers. During inference GPT is not answering the question: “what distribution over tokens would minimise my (cumulative future) predictive loss given my current prompt/context?” but instead the question: “what distribution over tokens does the policy I learned return for this prompt/context?”
There’s a very real sense in which GPT does not care/is not trying to minimise its predictive loss during inference; it’s just evaluating the policy it learned during training.
And this won’t change even if GPT is scaled up to superintelligence; that just isn’t the limit that GPT converges to.
In the limit, GPT is just a much better function approximator (of the universe implied by its training data) with a more powerful/capable policy. It is still not an agent trying to minimise its predictive loss.
Direct optimisation is an inadequate ontology to describe the kind of artifact GPT is.
[QuintinPope is the one who pointed me in the direction of Beren’s article, but this is my own phrasing/presentation of the argument.]
I heavily recommend Beren’s “Deconfusing Direct vs Amortised Optimisation”. It’s a very important conceptual clarification.
Probably the most important blog post I’ve read this year.
Tl;Dr
Direct optimisers: systems that during inference directly choose actions to optimise some objective function. E.g. AIXI, MCTS, other planning
Direct optimisers perform inference by answering the question: “what output (e.g. action/strategy) maximises or minimises this objective function ([discounted] cumulative return and loss respectively).
Amortised optimisers: systems that learn to approximate some function during training and perform inference by evaluating the output of the approximated function on their inputs. E.g.: model free RL, LLMs, most supervised & self supervised(?) learning systems
Amortised optimisers can be seen as performing inference by answering the question “what output (e.g. action, probability distribution over tokens) does this learned function (policy, predictive model) return for this input (agent state, prompt).
Amortised optimisers evaluate a learned function, they don’t argmax/argmin anything.
[It’s called “amortised optimisation” because while learning the policy is expensive, the cost of inference is amortised over all evaluations of the learned policy.]
Some Commentary
Direct optimisation is much more sample efficient (an MCTS chess program can achieve optimal play with sufficient compute given only the rules of chess; an amortised chess program necessarily needs millions of games to learn from)
Direct optimisation is feasible in simple, deterministic, discrete, fully observable environments (e.g. tic-tac-toe, chess, go) but unwieldy in complex, stochastic, high dimensional environments (e.g. the real world). Some of the limitations of direct optimisation in rich environments seem complexity theoretic, so better algorithms won’t fix them
In practice some systems use a hybrid of the two approaches with most cognition performed in an amortised manner but planning deployed when necessary (e.g. system 2 vs system 1 in humans)
Limits of GPT
LLMs are almost purely amortised optimisers. Scaled up to superintelligence, they would still br amortised optimisers. During inference GPT is not answering the question: “what distribution over tokens would minimise my (cumulative future) predictive loss given my current prompt/context?” but instead the question: “what distribution over tokens does the policy I learned return for this prompt/context?”
There’s a very real sense in which GPT does not care/is not trying to minimise its predictive loss during inference; it’s just evaluating the policy it learned during training.
And this won’t change even if GPT is scaled up to superintelligence; that just isn’t the limit that GPT converges to.
In the limit, GPT is just a much better function approximator (of the universe implied by its training data) with a more powerful/capable policy. It is still not an agent trying to minimise its predictive loss.
Direct optimisation is an inadequate ontology to describe the kind of artifact GPT is.
[QuintinPope is the one who pointed me in the direction of Beren’s article, but this is my own phrasing/presentation of the argument.]