When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.
Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making each decision. And I’m not sure what you mean by “explicit objective function”. I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as “explicit”? If so, why would not being “explicit” disqualify humans as mesa optimizers? If not, please explain more what you mean?
Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.
I take your point that some models can behave like an optimizer at first glance but if you look closer it’s not really an optimizer after all. But this doesn’t answer my question: “Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.”
ETA: If you don’t have a realistic example in mind, and just think that we shouldn’t currently rule out the possibility that a non-optimizer might generalize in a way that is more dangerous than total failure, I think that’s a good thing to point out too. (I had already upvoted your post based on that.)
I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as “explicit”?
If the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that’s encoded within some other neural network, I suppose that’s a bit like saying that we have an “objective function.” I wouldn’t call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult.
If so, why would not being “explicit” disqualify humans as mesa optimizers? If not, please explain more what you mean?
I am only using the definition given. The definition clearly states that the objective function must be “explicit” not “implicit.”
This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don’t have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches.
“Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.”
I actually agree that I didn’t adequately argue this point. Right now I’m trying to come up with examples, and I estimate about a 50% chance that I’ll write a post about this in the future naming detailed examples.
For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don’t need a mesa optimizer to produce malign generalization.
Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making each decision. And I’m not sure what you mean by “explicit objective function”. I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as “explicit”? If so, why would not being “explicit” disqualify humans as mesa optimizers? If not, please explain more what you mean?
I take your point that some models can behave like an optimizer at first glance but if you look closer it’s not really an optimizer after all. But this doesn’t answer my question: “Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here.”
ETA: If you don’t have a realistic example in mind, and just think that we shouldn’t currently rule out the possibility that a non-optimizer might generalize in a way that is more dangerous than total failure, I think that’s a good thing to point out too. (I had already upvoted your post based on that.)
If the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that’s encoded within some other neural network, I suppose that’s a bit like saying that we have an “objective function.” I wouldn’t call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult.
I am only using the definition given. The definition clearly states that the objective function must be “explicit” not “implicit.”
This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don’t have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches.
I actually agree that I didn’t adequately argue this point. Right now I’m trying to come up with examples, and I estimate about a 50% chance that I’ll write a post about this in the future naming detailed examples.
For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don’t need a mesa optimizer to produce malign generalization.