ryan_greenblatt comments on Benchmarks for Detecting Measurement Tampering [Redwood Research]

ryan_greenblatt 6 Sep 2023 16:15 UTC
LW: 5 AF: 3
0
AF
I’ve updated the wording to entirely avoid discussion of positive and negative labels which just seems to be confusing.

So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering). We exclude all other data when training this probe. This sometimes generalizes to detecting measurement tampering on the untrusted set: distinguishing fake positives (cases where all measurements are positive due to tampering) from real positives (cases where all measurements are positive due to the outcome of interest).