Current LLMs are trivially mesa-optimisers under the original definition of that term.
I don’t get why people are still debating the question of whether future AIs are going to be mesa-optimisers. Unless I’ve missed something about the definition of the term, lots of current AI systems are mesa-optimisers. There were mesa-opimisers around before Risks from Learned Optimization in Advanced Machine Learning Systems was even published.
We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. .... Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.
GPT-4 is capable of making plans to achieve objectives if you prompt it to. It can even write code to find the local optimum of a function, or code to train another neural network, making it a mesa-meta-optimiser. If gradient descent is an optimiser, then GPT-4 certainly is.
Being a mesa-optimiser is just not a very strong condition. Any pre-transformer ml paper that tried to train neural networks to find better neural network training algorithms was making mesa-optimisers. It is very mundane and expected for reasonably general AIs to be mesa-optimisers. Any program that can solve even somewhat general problems is going to have a hard time not meeting the definition of an optimiser.
Maybe this is some sort of linguistic drift at work, where ‘mesa-optimiser’ has come to refer specifically to a sysytem that is only an optimiser, with one single set of objectives it will always try to accomplish in any situation. Fine.
The result of this imprecise use of the original term though, as I perceive it, is that people are still debating and researching whether future AI’s might start being mesa-optimisers, as if that was relevant to the will-they-kill-us-all question. But, at least sometimes, what they seem to actually concretely debate and research is whether future AIs might possibly start looking through search spaces to accomplish objectives, as if that wasn’t a thing current systems obviously already do.
I suspect a lot of the disagreement might be about whether LLMs are something like consistent / context-independent optimizers of e.g. some utility function (they seem very unlikely to), not whether they’re capable of optimization in various (e.g. prompt-dependent, problem-dependent) contexts.
The top comment also seems to be conflating whether a model is capable of (e.g. sometimes, in some contexts) mesaoptimizing and whether it is (consistently) mesaoptimizing. I interpret the quoted original definition as being about the second, which LLMs probably aren’t, though they’re capable of the first.
This seems like the kind of ontological confusion that the Simulators post discusses at length.
If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser.
Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with. If you don’t pass any valid function, it doesn’t optimise anything.
GPT-4, taken by itself, without a prompt, will optimise pretty much whatever you prompt it to optimise. If you don’t prompt it to optimise something, it usually doesn’t optimise anything.
I guess you could say GPT-4, unlike gradient descent, can do things other than optimise something. But if ever not optimising things excluded you from being an optimiser, humans wouldn’t be considered optimisers either.
So it seems to me that the paper just meant what it said in the quote. If you look through a search space to accomplish an objective, you are, at present, an optimiser.
If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser.
Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with.
Gradient descent, in this sense of the term, is not an optimizer according to Risks from Learned Optimization.
Consider that Risks from Learned Optimization talks a lot about “the base objective” and “the mesa-objective.” This only makes sense if the objects being discussed are optimization algorithms together with specific, fixed choices of objective function.
“Gradient descent” in the most general sense is—as you note—not this sort of thing. Therefore, gradient descent in that general sense is not the kind of thing that Risks from Learned Optimization is about.
Gradient descent in this general sense is a “two-argument function,” GD(f,o), where f is the thing to be optimized and o is the objective function. The objects of interest in Risks from Learned Optimization are curried single-argument versions of such functions, GDo(f) for some specific choice of o, considered as a function of f alone.
It’s fairly common for people to say “gradient descent” when they mean GDo for some specific o, rather than the more generic GD. This is because in practice—unless you’re doing some weird experimental thing that’s not really “gradient descent” per se -- o is always fixed across the course of a run of gradient descent. When you run gradient descent to optimize an f, the result you get was not “optimized by gradient descent in general” (what would that even mean?), it was optimized for whichever o you chose by the corresponding GDo.
This is what licenses talking about “the base objective” when considering an SGD training run of a neural net. There is a base objective in such runs, it’s the loss function, we know exactly what it is, we wrote it down.
On the other hand, the notion that the optimized fs would have “mesa-objectives”—that they would themselves be objects like GDo with their own unchanging os, rather than being simply capable of context-dependent optimization of various targets, like GPT-4 or GD -- is a non-obvious claim/assumption(?) made by Risks from Learned Optimization. This claim doesn’t hold for GPT-4, and that’s why it is not a mesa-optimizer.
It is surely possible that there are mesa optimizers present in many, even relatively simple LLMs. But the question is: How powerful are these? How large is the state space that they can search through, for example? The state space of the mesa-optimizer can’t be larger than the the context window it is using to generate the answer, for example, while the state space of the full LLM is much bigger—basically all its weights.
Current LLMs are trivially mesa-optimisers under the original definition of that term.
Do current LLMs produce several options then compare them according to an objective function?
They do, actually, evaluate each of possible output tokens, then emitting one of the most probable ones, but I think that concern is more about AI comparing larger chunks of text (for instance, evaluating paragraphs of a report by stakeholders’ reaction).
Current LLMs are trivially mesa-optimisers under the original definition of that term.
I don’t get why people are still debating the question of whether future AIs are going to be mesa-optimisers. Unless I’ve missed something about the definition of the term, lots of current AI systems are mesa-optimisers. There were mesa-opimisers around before Risks from Learned Optimization in Advanced Machine Learning Systems was even published.
GPT-4 is capable of making plans to achieve objectives if you prompt it to. It can even write code to find the local optimum of a function, or code to train another neural network, making it a mesa-meta-optimiser. If gradient descent is an optimiser, then GPT-4 certainly is.
Being a mesa-optimiser is just not a very strong condition. Any pre-transformer ml paper that tried to train neural networks to find better neural network training algorithms was making mesa-optimisers. It is very mundane and expected for reasonably general AIs to be mesa-optimisers. Any program that can solve even somewhat general problems is going to have a hard time not meeting the definition of an optimiser.
Maybe this is some sort of linguistic drift at work, where ‘mesa-optimiser’ has come to refer specifically to a sysytem that is only an optimiser, with one single set of objectives it will always try to accomplish in any situation. Fine.
The result of this imprecise use of the original term though, as I perceive it, is that people are still debating and researching whether future AI’s might start being mesa-optimisers, as if that was relevant to the will-they-kill-us-all question. But, at least sometimes, what they seem to actually concretely debate and research is whether future AIs might possibly start looking through search spaces to accomplish objectives, as if that wasn’t a thing current systems obviously already do.
I suspect a lot of the disagreement might be about whether LLMs are something like consistent / context-independent optimizers of e.g. some utility function (they seem very unlikely to), not whether they’re capable of optimization in various (e.g. prompt-dependent, problem-dependent) contexts.
The top comment also seems to be conflating whether a model is capable of (e.g. sometimes, in some contexts) mesaoptimizing and whether it is (consistently) mesaoptimizing. I interpret the quoted original definition as being about the second, which LLMs probably aren’t, though they’re capable of the first. This seems like the kind of ontological confusion that the Simulators post discusses at length.
If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser.
Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with. If you don’t pass any valid function, it doesn’t optimise anything.
GPT-4, taken by itself, without a prompt, will optimise pretty much whatever you prompt it to optimise. If you don’t prompt it to optimise something, it usually doesn’t optimise anything.
I guess you could say GPT-4, unlike gradient descent, can do things other than optimise something. But if ever not optimising things excluded you from being an optimiser, humans wouldn’t be considered optimisers either.
So it seems to me that the paper just meant what it said in the quote. If you look through a search space to accomplish an objective, you are, at present, an optimiser.
Gradient descent, in this sense of the term, is not an optimizer according to Risks from Learned Optimization.
Consider that Risks from Learned Optimization talks a lot about “the base objective” and “the mesa-objective.” This only makes sense if the objects being discussed are optimization algorithms together with specific, fixed choices of objective function.
“Gradient descent” in the most general sense is—as you note—not this sort of thing. Therefore, gradient descent in that general sense is not the kind of thing that Risks from Learned Optimization is about.
Gradient descent in this general sense is a “two-argument function,” GD(f,o), where f is the thing to be optimized and o is the objective function. The objects of interest in Risks from Learned Optimization are curried single-argument versions of such functions, GDo(f) for some specific choice of o, considered as a function of f alone.
It’s fairly common for people to say “gradient descent” when they mean GDo for some specific o, rather than the more generic GD. This is because in practice—unless you’re doing some weird experimental thing that’s not really “gradient descent” per se -- o is always fixed across the course of a run of gradient descent. When you run gradient descent to optimize an f, the result you get was not “optimized by gradient descent in general” (what would that even mean?), it was optimized for whichever o you chose by the corresponding GDo.
This is what licenses talking about “the base objective” when considering an SGD training run of a neural net. There is a base objective in such runs, it’s the loss function, we know exactly what it is, we wrote it down.
On the other hand, the notion that the optimized fs would have “mesa-objectives”—that they would themselves be objects like GDo with their own unchanging os, rather than being simply capable of context-dependent optimization of various targets, like GPT-4 or GD -- is a non-obvious claim/assumption(?) made by Risks from Learned Optimization. This claim doesn’t hold for GPT-4, and that’s why it is not a mesa-optimizer.
It is surely possible that there are mesa optimizers present in many, even relatively simple LLMs. But the question is: How powerful are these? How large is the state space that they can search through, for example? The state space of the mesa-optimizer can’t be larger than the the context window it is using to generate the answer, for example, while the state space of the full LLM is much bigger—basically all its weights.
Do current LLMs produce several options then compare them according to an objective function?
They do, actually, evaluate each of possible output tokens, then emitting one of the most probable ones, but I think that concern is more about AI comparing larger chunks of text (for instance, evaluating paragraphs of a report by stakeholders’ reaction).