Joe Carlsmith comments on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Joe Carlsmith 27 Nov 2023 22:44 UTC
LW: 4 AF: 3
0
AF
Agents that end up intrinsically motivated to get reward on the episode would be “terminal training-gamers/reward-on-the-episode seekers,” and not schemers, on my taxonomy. I agree that terminal training-gamers can also be motivated to seek power in problematic ways (I discuss this in the section on “non-schemers with schemer-like traits”), but I think that schemers proper are quite a bit scarier than reward-on-the-episode seekers, for reasons I describe here.