According to me (and at least some if not all of the authors of that paper disagree with me), the main point is highlighting the possibility of capabilities generalizing while objectives do not. I agree that this is a failure mode that we knew about before, but it’s not one that people were paying much attention to. At the very least, when people said they worked on “robustness”, they weren’t distinguishing between capability failure vs. objective failure (though of course the line between these is blurry).
Although on the other hand, decade+ old arguments about the instrumental utility of good behavior while dependent on humans have more or less the same format. Seeing good behavior is better evidence of intelligence (capabilities generalizing) than it is of benevolence (goals ‘generalizing’).
The big difference is that the olde-style argument would be about actual agents being evaluated by humans, while the mesa-optimizers argument is about potential configurations of a reinforcement learner being evaluated by a reward function.
(Really minor formatting nitpick, but it’s the kind of thing that really trips me up while reading, but you forgot a closing parenthesis somewhere in your comment)
According to me (and at least some if not all of the authors of that paper disagree with me), the main point is highlighting the possibility of capabilities generalizing while objectives do not. I agree that this is a failure mode that we knew about before, but it’s not one that people were paying much attention to. At the very least, when people said they worked on “robustness”, they weren’t distinguishing between capability failure vs. objective failure (though of course the line between these is blurry).
Although on the other hand, decade+ old arguments about the instrumental utility of good behavior while dependent on humans have more or less the same format. Seeing good behavior is better evidence of intelligence (capabilities generalizing) than it is of benevolence (goals ‘generalizing’).
The big difference is that the olde-style argument would be about actual agents being evaluated by humans, while the mesa-optimizers argument is about potential configurations of a reinforcement learner being evaluated by a reward function.
(Really minor formatting nitpick, but it’s the kind of thing that really trips me up while reading, but you forgot a closing parenthesis somewhere in your comment)
)
oh no
Fixed, thanks.