I think maybe the wording here is somewhat confusing with “negative labels” and “positive labels”.
Here’s a diagram explaining what we’re doing:
So, in particular, when I say “the trusted set with negative labels” I mean “label the entire trusted set with negative labels for the probe”. And when I say “inconsistent measurements (which have tampering) with positive labels” I mean “label inconsistent measurements with positive labels for the probe”. I’ll try to improve the wording in the post.
I’ve updated the wording to entirely avoid discussion of positive and negative labels which just seems to be confusing.
So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering). We exclude all other data when training this probe. This sometimes generalizes to detecting measurement tampering on the untrusted set: distinguishing fake positives (cases where all measurements are positive due to tampering) from real positives (cases where all measurements are positive due to the outcome of interest).
I think maybe the wording here is somewhat confusing with “negative labels” and “positive labels”.
Here’s a diagram explaining what we’re doing:
So, in particular, when I say “the trusted set with negative labels” I mean “label the entire trusted set with negative labels for the probe”. And when I say “inconsistent measurements (which have tampering) with positive labels” I mean “label inconsistent measurements with positive labels for the probe”. I’ll try to improve the wording in the post.
I’ve updated the wording to entirely avoid discussion of positive and negative labels which just seems to be confusing.