(I am unfortunately currently bogged down with external academic pressures, and so cannot engage with this at the depth I’d like to, but here’s some initial thoughts.)
I endorse this post. The distinction explained here seems interesting and fruitful.
I agree with the idea to treat selection and control as two kinds of analysis, rather than as two kinds of object – I think this loosely maps onto the distinction we make between the mesa-objective and the behavioural objective. The former takes the selection view of the learned algorithm; the latter takes the control view.
At least speaking for myself (the other authors might have different thoughts on this), the decision to talk explicitly in terms of the selection view in the mesa-optimiser post is based on an intuition that selectors, in general, have more coherently defined counterfactual behaviour. That is, given a very different input, a selector will still select an output that scores well on its mesa-objective, because that’s how selectors work. Whereas a controller, to the degree it optimises for an objective, seems more likely to just completely stop working on a different input. I have fairly low confidence in this argument, however: it seems to me that one can plausibly have pretty coherent counterfactual behaviour in a very broad distribution even without doing selection. And since it is ultimately the behaviour that does the damage, it would be good to have a working distinction that is based purely on that. We (the mesa-optimisation authors) haven’t been able to come up with one.
Another reason to be interested in selectors is that in RL, the learned algorithm is supposed to fill a controller role. So, restricting attention to selectors allows to talk at least somewhat meaningfully about non-optimiser agents, which is otherwise difficult, as any learned agent is in a controller-shaped context.
In any case, I hope that more work happens on this problem, either dissolving the need to talk about optimisation, or at least making all these distinctions more precise. The vagueness of everything is currently my biggest worry about the legitimacy of mesa-optimiser concerns.
Yeah, I agree with most of what you’re saying here.
A learned controller which isn’t implementing any internal selection seems more likely to be incoherent out-of-distribution (ie lack a strong IRL interpretation of its behavior), as compared with a mesa-optimizer;
However, this is a low-confidence argument at present; it’s very possible that coherent controllers can appear w/o necessarily having a behavioral objective which matches the original objective, in which case a version of the internal alignment problem applies. (But this might be a significantly different version of the internal alignment problem.)
I think a crux here is: to what extent are mesa-controllers with simple behavioral objectives going to be simple? The argument that mesa-optimizers can compress coherent strategies does not apply here.
Actually, I think there’s an argument that certain kinds of mesa-controllers can be simple: the mesa-controllers which are more like my rocket example (explicit world model; explicit representation of objective within that world model; but, optimal policy does not use any search). There is also other reason to suspect that these could survive techniques which are designed to make sure mesa-optimizers don’t arise: they aren’t expending a bunch of processing power on an internal search, so, you can’t eliminate them with some kind of processing-power device. (Not that we know of any such device that eliminates mesa-optimizers—but if we did, it may not help with rocket-type mesa-controllers.)
Terminology point: I like the term ‘selection’ for the cluster I’m pointing at, but, I keep finding myself tempted to say ‘search’ in an AI context. Possibly, ‘search vs control’ would be better terminology.
to what extent are mesa-controllers with simple behavioural objectives going to be simple?
I’m not sure what “simple behavioural objective” really means. But I’d expect that for tasks requiring very simple policies, controllers would do, whereas the more complicated the policy required to solve a task, the more one would need to do some kind of search. Is this what we observe? I’m not sure. AlphaStar and OpenAI Five seem to do well enough in relatively complex domains without any explicit search built into the architecture. Are they using their recurrence to search internally? Who knows. I doubt it, but it’s not implausible.
certain kinds of mesa-controllers can be simple: the mesa-controllers which are more like my rocket example (explicit world-model; explicit representation of objective within that world model; but, optimal policy does not use any search).
The rocket example is interesting. I guess the question for me there is, what sorts of tasks admit an optimal policy that can be represented in this way? Here it also seems to me like the more complex an environment, the more implausible it seems that a powerful policy can be successfully represented with straightforward functions. E.g., let’s say we want a rocket not just to get to the target, but to self-identify a good target in an area and pick a trajectory that evades countermeasures. I would be somewhat surprised if we can still represent the best policy as a set of non-searchy functions. So I have this intuition that for complex state spaces, it’s hard to find pure controllers that do the job well.
Yeah, I agree that this seems possible, but extremely unclear. If something uses a fairly complex algorithm like FFT, is it search? How “sophisticated” can we get without using search? How can we define “search” and “sophisticated” so that the answer is “not very sophisticated”?
(I am unfortunately currently bogged down with external academic pressures, and so cannot engage with this at the depth I’d like to, but here’s some initial thoughts.)
I endorse this post. The distinction explained here seems interesting and fruitful.
I agree with the idea to treat selection and control as two kinds of analysis, rather than as two kinds of object – I think this loosely maps onto the distinction we make between the mesa-objective and the behavioural objective. The former takes the selection view of the learned algorithm; the latter takes the control view.
At least speaking for myself (the other authors might have different thoughts on this), the decision to talk explicitly in terms of the selection view in the mesa-optimiser post is based on an intuition that selectors, in general, have more coherently defined counterfactual behaviour. That is, given a very different input, a selector will still select an output that scores well on its mesa-objective, because that’s how selectors work. Whereas a controller, to the degree it optimises for an objective, seems more likely to just completely stop working on a different input. I have fairly low confidence in this argument, however: it seems to me that one can plausibly have pretty coherent counterfactual behaviour in a very broad distribution even without doing selection. And since it is ultimately the behaviour that does the damage, it would be good to have a working distinction that is based purely on that. We (the mesa-optimisation authors) haven’t been able to come up with one.
Another reason to be interested in selectors is that in RL, the learned algorithm is supposed to fill a controller role. So, restricting attention to selectors allows to talk at least somewhat meaningfully about non-optimiser agents, which is otherwise difficult, as any learned agent is in a controller-shaped context.
In any case, I hope that more work happens on this problem, either dissolving the need to talk about optimisation, or at least making all these distinctions more precise. The vagueness of everything is currently my biggest worry about the legitimacy of mesa-optimiser concerns.
Yeah, I agree with most of what you’re saying here.
A learned controller which isn’t implementing any internal selection seems more likely to be incoherent out-of-distribution (ie lack a strong IRL interpretation of its behavior), as compared with a mesa-optimizer;
However, this is a low-confidence argument at present; it’s very possible that coherent controllers can appear w/o necessarily having a behavioral objective which matches the original objective, in which case a version of the internal alignment problem applies. (But this might be a significantly different version of the internal alignment problem.)
I think a crux here is: to what extent are mesa-controllers with simple behavioral objectives going to be simple? The argument that mesa-optimizers can compress coherent strategies does not apply here.
Actually, I think there’s an argument that certain kinds of mesa-controllers can be simple: the mesa-controllers which are more like my rocket example (explicit world model; explicit representation of objective within that world model; but, optimal policy does not use any search). There is also other reason to suspect that these could survive techniques which are designed to make sure mesa-optimizers don’t arise: they aren’t expending a bunch of processing power on an internal search, so, you can’t eliminate them with some kind of processing-power device. (Not that we know of any such device that eliminates mesa-optimizers—but if we did, it may not help with rocket-type mesa-controllers.)
Terminology point: I like the term ‘selection’ for the cluster I’m pointing at, but, I keep finding myself tempted to say ‘search’ in an AI context. Possibly, ‘search vs control’ would be better terminology.
I’m not sure what “simple behavioural objective” really means. But I’d expect that for tasks requiring very simple policies, controllers would do, whereas the more complicated the policy required to solve a task, the more one would need to do some kind of search. Is this what we observe? I’m not sure. AlphaStar and OpenAI Five seem to do well enough in relatively complex domains without any explicit search built into the architecture. Are they using their recurrence to search internally? Who knows. I doubt it, but it’s not implausible.
The rocket example is interesting. I guess the question for me there is, what sorts of tasks admit an optimal policy that can be represented in this way? Here it also seems to me like the more complex an environment, the more implausible it seems that a powerful policy can be successfully represented with straightforward functions. E.g., let’s say we want a rocket not just to get to the target, but to self-identify a good target in an area and pick a trajectory that evades countermeasures. I would be somewhat surprised if we can still represent the best policy as a set of non-searchy functions. So I have this intuition that for complex state spaces, it’s hard to find pure controllers that do the job well.
Yeah, I agree that this seems possible, but extremely unclear. If something uses a fairly complex algorithm like FFT, is it search? How “sophisticated” can we get without using search? How can we define “search” and “sophisticated” so that the answer is “not very sophisticated”?