The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:
For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.
The idea there isn’t that the algorithm simulates human judgement as an external source of information for itself, but that the actual algorithm learns to be a human-like reasoner, with human-like goals (because that’s a good way of approximating the output of human-like reasoning). In that case, the agent really is a mesa-optimiser, to the degree that a goal-directed human-like reasoner is an optimiser.
(I’m not sure to what degree it’s actually likely that a good way to approximate the behaviour of human-like reasoning is to instantiate human-like reasoning)
Just to make sure I understand, this example assumes that the base objective is “predict human behavior”, and doesn’t apply to most base objectives, right?
Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.
I do think this is a pretty speculative argument, even for this sequence.
The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:
The idea there isn’t that the algorithm simulates human judgement as an external source of information for itself, but that the actual algorithm learns to be a human-like reasoner, with human-like goals (because that’s a good way of approximating the output of human-like reasoning). In that case, the agent really is a mesa-optimiser, to the degree that a goal-directed human-like reasoner is an optimiser.
(I’m not sure to what degree it’s actually likely that a good way to approximate the behaviour of human-like reasoning is to instantiate human-like reasoning)
Just to make sure I understand, this example assumes that the base objective is “predict human behavior”, and doesn’t apply to most base objectives, right?
Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.
I do think this is a pretty speculative argument, even for this sequence.