A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself, resulting in a mesa-optimizer.
In this case it seems like you’d have a policy that uses “optimization machinery” to:
Predict what other agents are going to do
Create plans to achieve some form of inner objective
So, the model-outputted-by-the-base-optimization is a policy that chooses how to use the optimization machinery, not the optimization machinery itself. This seems substantially different from your initial concept of a mesa-optimizer
Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.
and seems more like a subagent. But perhaps I’ve misunderstood what you meant by a mesa-optimizer.
The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:
For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.
The idea there isn’t that the algorithm simulates human judgement as an external source of information for itself, but that the actual algorithm learns to be a human-like reasoner, with human-like goals (because that’s a good way of approximating the output of human-like reasoning). In that case, the agent really is a mesa-optimiser, to the degree that a goal-directed human-like reasoner is an optimiser.
(I’m not sure to what degree it’s actually likely that a good way to approximate the behaviour of human-like reasoning is to instantiate human-like reasoning)
Just to make sure I understand, this example assumes that the base objective is “predict human behavior”, and doesn’t apply to most base objectives, right?
Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.
I do think this is a pretty speculative argument, even for this sequence.
The idea would be that all of this would be learned—if the optimization machinery is entirely internal to the system, it can choose how to use that optimization machinery arbitrarily. We talk briefly about systems where the optimization is hard-coded, but those aren’t mesa-optimizers. Rather, we’re interested in situations where your learned algorithm itself performs optimization internal to its own workings—optimization it could re-use to do prediction or vice versa.
It sounds like there was a misunderstanding somewhere—I’m aware that all of this would be learned; my point is that the learned policy contains an optimizer rather than being an optimizer, which seems like a significant point, and your original definition sounded like you wanted the learned policy to be an optimizer.
In this case it seems like you’d have a policy that uses “optimization machinery” to:
Predict what other agents are going to do
Create plans to achieve some form of inner objective
So, the model-outputted-by-the-base-optimization is a policy that chooses how to use the optimization machinery, not the optimization machinery itself. This seems substantially different from your initial concept of a mesa-optimizer
and seems more like a subagent. But perhaps I’ve misunderstood what you meant by a mesa-optimizer.
The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:
The idea there isn’t that the algorithm simulates human judgement as an external source of information for itself, but that the actual algorithm learns to be a human-like reasoner, with human-like goals (because that’s a good way of approximating the output of human-like reasoning). In that case, the agent really is a mesa-optimiser, to the degree that a goal-directed human-like reasoner is an optimiser.
(I’m not sure to what degree it’s actually likely that a good way to approximate the behaviour of human-like reasoning is to instantiate human-like reasoning)
Just to make sure I understand, this example assumes that the base objective is “predict human behavior”, and doesn’t apply to most base objectives, right?
Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.
I do think this is a pretty speculative argument, even for this sequence.
The idea would be that all of this would be learned—if the optimization machinery is entirely internal to the system, it can choose how to use that optimization machinery arbitrarily. We talk briefly about systems where the optimization is hard-coded, but those aren’t mesa-optimizers. Rather, we’re interested in situations where your learned algorithm itself performs optimization internal to its own workings—optimization it could re-use to do prediction or vice versa.
It sounds like there was a misunderstanding somewhere—I’m aware that all of this would be learned; my point is that the learned policy contains an optimizer rather than being an optimizer, which seems like a significant point, and your original definition sounded like you wanted the learned policy to be an optimizer.