First, I think by this definition humans are clearly not mesa optimizers.
I’m confused/unconvinced. Surely the 9/11 attackers, for example, must have “internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system”? Can you give some examples of humans being highly dangerous without having done this kind of explicit optimization?
As far as I can tell, Hjalmar Wijk introduced the term “malign generalization” to describe the failure mode that I think is most worth worrying about here.
Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.
Surely the 9/11 attackers, for example, must have “internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system”?
ETA: I agree if someone were to eg. write a spreadsheet of all the things they could do, and write the costs of those actions, and then choose the one with the lowest cost, this would certainly count. And maybe terrorist organizations do a lot of deliberation that meets this kind of criteria. But I am responding to the more typical type of human action: walking around, seeking food, talking to others, working at a job.
There are two reasons why we might model something as an optimizer. The first reason is that we know that it is internally performing some type of search over strategies in its head, and then outputting the strategy that ranks highest under some explicit objective function. The second reason is that, given our ignorant epistemic state, our best model of that object is that it is optimizing some goal. We might call the second case the intentional stance, following Dennett.
If we could show that the first case was true in humans, then I would agree that humans would be mesa optimizers. However, my primary objection is that we could have better models of what the brain is actually doing. It’s often the case that when you don’t know how something works, the best way of understanding it is by modeling it as an optimizer. However, once you get to look inside and see what’s going on, this way of thinking lends to better models which take into account the specifics of its operation.
I suspect that human brains are well modeled as optimizers from the outside, but that this view falls apart when considering specific cases. When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.
But since this is all a bit vague, and hard to see in the case of humans, I can provide the analogy that I gave in the post above.
At a first glance, someone who looked at the agent in the Chests and Keys environment would assume that it was performing an internal search, and then selecting the action that ranked highest in its preference ordering, where its preference ordering was something like “more keys is better.” This would be a good model, but we could still do better.
In fact, the only selection that’s really happening is at the last stage of the neural network, when the max function is being applied over its output layer. Otherwise, all it’s really doing is applying a simple heuristic: if there are no keys on the board, move along the wall; otherwise, move towards the key currently in sight. Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.
When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.
Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making each decision. And I’m not sure what you mean by “explicit objective function”. I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as “explicit”? If so, why would not being “explicit” disqualify humans as mesa optimizers? If not, please explain more what you mean?
Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.
I take your point that some models can behave like an optimizer at first glance but if you look closer it’s not really an optimizer after all. But this doesn’t answer my question: “Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.”
ETA: If you don’t have a realistic example in mind, and just think that we shouldn’t currently rule out the possibility that a non-optimizer might generalize in a way that is more dangerous than total failure, I think that’s a good thing to point out too. (I had already upvoted your post based on that.)
I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as “explicit”?
If the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that’s encoded within some other neural network, I suppose that’s a bit like saying that we have an “objective function.” I wouldn’t call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult.
If so, why would not being “explicit” disqualify humans as mesa optimizers? If not, please explain more what you mean?
I am only using the definition given. The definition clearly states that the objective function must be “explicit” not “implicit.”
This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don’t have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches.
“Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.”
I actually agree that I didn’t adequately argue this point. Right now I’m trying to come up with examples, and I estimate about a 50% chance that I’ll write a post about this in the future naming detailed examples.
For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don’t need a mesa optimizer to produce malign generalization.
I’m confused/unconvinced. Surely the 9/11 attackers, for example, must have “internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system”? Can you give some examples of humans being highly dangerous without having done this kind of explicit optimization?
Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.
ETA: I agree if someone were to eg. write a spreadsheet of all the things they could do, and write the costs of those actions, and then choose the one with the lowest cost, this would certainly count. And maybe terrorist organizations do a lot of deliberation that meets this kind of criteria. But I am responding to the more typical type of human action: walking around, seeking food, talking to others, working at a job.
There are two reasons why we might model something as an optimizer. The first reason is that we know that it is internally performing some type of search over strategies in its head, and then outputting the strategy that ranks highest under some explicit objective function. The second reason is that, given our ignorant epistemic state, our best model of that object is that it is optimizing some goal. We might call the second case the intentional stance, following Dennett.
If we could show that the first case was true in humans, then I would agree that humans would be mesa optimizers. However, my primary objection is that we could have better models of what the brain is actually doing. It’s often the case that when you don’t know how something works, the best way of understanding it is by modeling it as an optimizer. However, once you get to look inside and see what’s going on, this way of thinking lends to better models which take into account the specifics of its operation.
I suspect that human brains are well modeled as optimizers from the outside, but that this view falls apart when considering specific cases. When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.
But since this is all a bit vague, and hard to see in the case of humans, I can provide the analogy that I gave in the post above.
At a first glance, someone who looked at the agent in the Chests and Keys environment would assume that it was performing an internal search, and then selecting the action that ranked highest in its preference ordering, where its preference ordering was something like “more keys is better.” This would be a good model, but we could still do better.
In fact, the only selection that’s really happening is at the last stage of the neural network, when the max function is being applied over its output layer. Otherwise, all it’s really doing is applying a simple heuristic: if there are no keys on the board, move along the wall; otherwise, move towards the key currently in sight. Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.
Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making each decision. And I’m not sure what you mean by “explicit objective function”. I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as “explicit”? If so, why would not being “explicit” disqualify humans as mesa optimizers? If not, please explain more what you mean?
I take your point that some models can behave like an optimizer at first glance but if you look closer it’s not really an optimizer after all. But this doesn’t answer my question: “Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.”
ETA: If you don’t have a realistic example in mind, and just think that we shouldn’t currently rule out the possibility that a non-optimizer might generalize in a way that is more dangerous than total failure, I think that’s a good thing to point out too. (I had already upvoted your post based on that.)
If the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that’s encoded within some other neural network, I suppose that’s a bit like saying that we have an “objective function.” I wouldn’t call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult.
I am only using the definition given. The definition clearly states that the objective function must be “explicit” not “implicit.”
This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don’t have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches.
I actually agree that I didn’t adequately argue this point. Right now I’m trying to come up with examples, and I estimate about a 50% chance that I’ll write a post about this in the future naming detailed examples.
For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don’t need a mesa optimizer to produce malign generalization.