Charlie Steiner comments on Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

Charlie Steiner 19 Jul 2020 7:38 UTC
LW: 8 AF: 6
AF
Although on the other hand, decade+ old arguments about the instrumental utility of good behavior while dependent on humans have more or less the same format. Seeing good behavior is better evidence of intelligence (capabilities generalizing) than it is of benevolence (goals ‘generalizing’).

The big difference is that the olde-style argument would be about actual agents being evaluated by humans, while the mesa-optimizers argument is about potential configurations of a reinforcement learner being evaluated by a reward function.