Once we start searching over policies that understand the world well enough, we run into a problem: any influence-seeking policies we stumble across would also score well according to our training objective, because performing well on the training objective is a good strategy for obtaining influence.
I’m slightly confused by this. It sounds like “(1) ML systems will do X because X will be rewarded according to the objective, and (2) X will be rewarded according to the objective becausebeing rewardedwill accomplish X”. But (2) sounds circular—I see that performing well on the training objective gives influence, but I would’ve thought only effects (direct and indirect) on the objective are relevant in determining which behaviors ML systems pick up, not effects on obtaining influence.
Maybe that’s the intended meaning—I’m just misreading this passage, but also maybe I’m missing some deeper point here?
Terrific post, by the way, still now four years later.
Consider a competent policy that wants paperclips in the very long run. It could reason “I should get a low loss to get paperclips,” and then get a low loss. As a result, it could be selected by gradient descent.
I’m slightly confused by this. It sounds like “(1) ML systems will do X because X will be rewarded according to the objective, and (2) X will be rewarded according to the objective because being rewarded will accomplish X”. But (2) sounds circular—I see that performing well on the training objective gives influence, but I would’ve thought only effects (direct and indirect) on the objective are relevant in determining which behaviors ML systems pick up, not effects on obtaining influence.
Maybe that’s the intended meaning—I’m just misreading this passage, but also maybe I’m missing some deeper point here?
Terrific post, by the way, still now four years later.
Consider a competent policy that wants paperclips in the very long run. It could reason “I should get a low loss to get paperclips,” and then get a low loss. As a result, it could be selected by gradient descent.