TL;DR: Gradient descent won’t select for inner search processes because they’re not compute & memory efficient.
Slightly longer TL;DR: A key argument for mesa-optimization is that as we search over programs, we will select for “search processes with simple objectives”, because they are simpler or more compact than alternative less dangerous programs. This argument is much weaker when your program search is restricted to programs that use a fixed amount of compute, and you’re not optimizing strongly for low description length—e.g. gradient descent in modern deep learning systems. We don’t really know what shape of programs gradient descent selects for in realistic environments, but they are much less likely to involve search than commonly believed.
Note on terminology (added in response to comments): By “search” I mean here a process that evaluates a number of candidates before returning the best one; what Abram Demski calls “selection” in Selection vs Control . The more candidates considered, the more “search-like” a process is—with gradient descent and A* being central examples, and a thermostat being a central counter-example.
In some tasks, good performance requires a very complex policy. At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy.
One way to find a compressed policy is to search for one that is able to use general features of the task structure to produce good behavior, rather than simply memorizing the correct output for each input. A mesa-optimizer is an example of such a policy. From the perspective of the base optimizer, a mesa-optimizer is a highly-compressed version of whatever policy it ends up implementing: instead of explicitly encoding the details of that policy in the learned algorithm, the base optimizer simply needs to encode how to search for such a policy. Furthermore, if a mesa-optimizer can determine the important features of its environment at runtime, it does not need to be given as much prior information as to what those important features are, and can thus be much simpler.
We don’t know that the AI will necessarily end up optimizing reward-button-pushes or smiles; there may be other similarly-compact proxies which correlate near-perfectly with reward in the training process. We can probably rule out “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).
Compactness, Complexity, and Compute
At face value, it does seem like we’re selecting programs for simplicity. The Deep Double Descent paper showed us that gradient descent training in the overparametrized regime (i.e. the regime of all modern deep models) favors simpler models. But is this notion of simplicity the same as “compactness” or “complexity”? Evan seems to think so, I’m less sure. Let’s dive into the different notions of complexity here.
The most commonly used notion of program complexity is Kolmogorov complexity (or description length), basically just “length of the program in some reference programming language”. This definition seems natural… but, critically, it assumes away all computational constraints. K-complexity doesn’t care if your program completes in a millisecond or runs until the heat death of the universe. This makes it a particularly perverse notion of complexity to use when interacting with the real, computationally constrained world.
Why do we use Kolmogorov complexity at all, then? One reason it’s so intuitive is that it’s the primary notion of complexity that humans programmers use in their work. Programs that humans write are selected to be short, hence easy to write and analyze. Memory and compute use are often secondary concerns, because in most programming contexts human time is so much more valuable than computer time.
A simplistic model for how humans do program search is “What’s the shortest program I can implement that solves the problem exactly?”. The answer is usually some form of explicit search with a simple objective, because with no computational constraints that’s always the shortest description length program (as an extreme example, bogosort is probably the lowest-description-length sorting procedure).
In contrast, the way modern deep learning does program search is “I can tweak the parameters of a massively parallel computation involving at most 50 serial steps, 100 million multiply-adds per step, and 100 kilobytes of memory to store intermediate results; what parameters get me the best approximate solution given this compute budget?”. Here it seems much less clear that the resulting program is going to resemble search; it might well be a diffuse bundle of heuristics, or something else that humans don’t have good intuitions for.
What’s going on here? Once we stop assuming compute is infinite and instead fix a budget, you can no longer rely on the “compression trick” that lets you replace an actual policy with a search for the policy. Search isn’t free; it costs you compute and memory that could have been spent running a better policy, and gradient descent doesn’t select for low Kolmogorov complexity nearly enough (if at all) to compensate.
Some caveats to this conclusion:
This argument only applies to inner search, rather than any form of inner optimization. In particular, gradient descent might select for programs that perform control, even hierarchical control(see selection vs control). It could select for other features of optimization such as “goal-directed behavior” and “internal representations of outcomes”, like a thermostat that diffs the current temperature with the desired temperature to generate the next action. (Thanks to Richard Ngofor helping me clarify this point.)
You might still select for search in certain domains with few exploitable regularities (like RL agents in randomized mazes) where there are no shortcuts so there’s little or no compute penalty to doing search. In such domains, the simplicity bias could prove decisive and the model will indeed learn an inner search process.
The standard example of mesa-optimization is humans w.r.t evolution, but note that evolution is also selecting for low Kolmogorov complexity because of the genomic bottleneck. It doesn’t seem like current gradient descent has such a bottleneck currently (on the contrary, models are massively over-parametrized), but it’s possible that future methods for training neural nets will have it (e.g. because of bandwidth constraints in massively-federated training setups).
Added August 16: John Wentworth responds with two more arguments for why we’ll select for inner search, the existence of general-purpose heuristics and the recursive structure of search.
(Speculative) Implications for interpretability and safety
If most of the danger of inner optimization comes from search and planning[1], this line of reasoning has some unexpected implications. In AI safety we generally think of large black-box models as dangerous, and small interpretable models as safer. But the bias towards explicit search with simple objectives turns out to result from a human bias towards mechanistically interpretable (i.e. low-Kolmogorov-complexity) programs, a bias that often comes at significant computational cost.
If what we really care about when we talk about interpretability is predicting generalization errors, mechanistic interpretability may not be what we want. There is hideous, unpredictable complexity lurking in mechanistically simple programs such as the 3n+1 algorithm and the Rule 110 cellular automaton. The black-box models that gradient descent converges on may turn out to be simpler and more predictable in the sense that really matters. An “agent AI” trained end-to-end with gradient descent may be safer in practice than a “tool AI” that generates C code given a problem description, because the latter shares human programmers’ bias towards explicit search over simple, brittle objectives. Which is where much of the danger lies.
Unclear how true this is—in the limit of capabilities, both search and control processes could be lethal. But intuitively, for a fixed level of capabilities, it seems like search processes are more brittle and likely to mis-generalize catastrophically than control processes.
Gradient descent doesn’t select for inner search
TL;DR: Gradient descent won’t select for inner search processes because they’re not compute & memory efficient.
Slightly longer TL;DR: A key argument for mesa-optimization is that as we search over programs, we will select for “search processes with simple objectives”, because they are simpler or more compact than alternative less dangerous programs. This argument is much weaker when your program search is restricted to programs that use a fixed amount of compute, and you’re not optimizing strongly for low description length—e.g. gradient descent in modern deep learning systems. We don’t really know what shape of programs gradient descent selects for in realistic environments, but they are much less likely to involve search than commonly believed.
Note on terminology (added in response to comments): By “search” I mean here a process that evaluates a number of candidates before returning the best one; what Abram Demski calls “selection” in Selection vs Control . The more candidates considered, the more “search-like” a process is—with gradient descent and A* being central examples, and a thermostat being a central counter-example.
Recap: compression argument for inner optimizers
Here’s the argument from Risks From Learned Optimization: [emphasis mine]
and even more forceful phrasing from John Wentworth:
Compactness, Complexity, and Compute
At face value, it does seem like we’re selecting programs for simplicity. The Deep Double Descent paper showed us that gradient descent training in the overparametrized regime (i.e. the regime of all modern deep models) favors simpler models. But is this notion of simplicity the same as “compactness” or “complexity”? Evan seems to think so, I’m less sure. Let’s dive into the different notions of complexity here.
The most commonly used notion of program complexity is Kolmogorov complexity (or description length), basically just “length of the program in some reference programming language”. This definition seems natural… but, critically, it assumes away all computational constraints. K-complexity doesn’t care if your program completes in a millisecond or runs until the heat death of the universe. This makes it a particularly perverse notion of complexity to use when interacting with the real, computationally constrained world.
Why do we use Kolmogorov complexity at all, then? One reason it’s so intuitive is that it’s the primary notion of complexity that humans programmers use in their work. Programs that humans write are selected to be short, hence easy to write and analyze. Memory and compute use are often secondary concerns, because in most programming contexts human time is so much more valuable than computer time.
A simplistic model for how humans do program search is “What’s the shortest program I can implement that solves the problem exactly?”. The answer is usually some form of explicit search with a simple objective, because with no computational constraints that’s always the shortest description length program (as an extreme example, bogosort is probably the lowest-description-length sorting procedure).
In contrast, the way modern deep learning does program search is “I can tweak the parameters of a massively parallel computation involving at most 50 serial steps, 100 million multiply-adds per step, and 100 kilobytes of memory to store intermediate results; what parameters get me the best approximate solution given this compute budget?”. Here it seems much less clear that the resulting program is going to resemble search; it might well be a diffuse bundle of heuristics, or something else that humans don’t have good intuitions for.
What’s going on here? Once we stop assuming compute is infinite and instead fix a budget, you can no longer rely on the “compression trick” that lets you replace an actual policy with a search for the policy. Search isn’t free; it costs you compute and memory that could have been spent running a better policy, and gradient descent doesn’t select for low Kolmogorov complexity nearly enough (if at all) to compensate.
Some caveats to this conclusion:
This argument only applies to inner search, rather than any form of inner optimization. In particular, gradient descent might select for programs that perform control, even hierarchical control (see selection vs control). It could select for other features of optimization such as “goal-directed behavior” and “internal representations of outcomes”, like a thermostat that diffs the current temperature with the desired temperature to generate the next action. (Thanks to Richard Ngo for helping me clarify this point.)
You might still select for search in certain domains with few exploitable regularities (like RL agents in randomized mazes) where there are no shortcuts so there’s little or no compute penalty to doing search. In such domains, the simplicity bias could prove decisive and the model will indeed learn an inner search process.
The standard example of mesa-optimization is humans w.r.t evolution, but note that evolution is also selecting for low Kolmogorov complexity because of the genomic bottleneck. It doesn’t seem like current gradient descent has such a bottleneck currently (on the contrary, models are massively over-parametrized), but it’s possible that future methods for training neural nets will have it (e.g. because of bandwidth constraints in massively-federated training setups).
Added August 16: John Wentworth responds with two more arguments for why we’ll select for inner search, the existence of general-purpose heuristics and the recursive structure of search.
(Speculative) Implications for interpretability and safety
If most of the danger of inner optimization comes from search and planning[1], this line of reasoning has some unexpected implications. In AI safety we generally think of large black-box models as dangerous, and small interpretable models as safer. But the bias towards explicit search with simple objectives turns out to result from a human bias towards mechanistically interpretable (i.e. low-Kolmogorov-complexity) programs, a bias that often comes at significant computational cost.
If what we really care about when we talk about interpretability is predicting generalization errors, mechanistic interpretability may not be what we want. There is hideous, unpredictable complexity lurking in mechanistically simple programs such as the 3n+1 algorithm and the Rule 110 cellular automaton. The black-box models that gradient descent converges on may turn out to be simpler and more predictable in the sense that really matters. An “agent AI” trained end-to-end with gradient descent may be safer in practice than a “tool AI” that generates C code given a problem description, because the latter shares human programmers’ bias towards explicit search over simple, brittle objectives. Which is where much of the danger lies.
Unclear how true this is—in the limit of capabilities, both search and control processes could be lethal. But intuitively, for a fixed level of capabilities, it seems like search processes are more brittle and likely to mis-generalize catastrophically than control processes.