ryan_greenblatt comments on Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt 24 Dec 2023 2:48 UTC
LW: 2 AF: 1
0
AF
Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).