Oliver Daniels-Koch comments on Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels-Koch 4 Oct 2023 0:08 UTC
1 point
0
Another (more substantive) question. Again from section 2.1.2
In the validation set, we exclude data points where the diamond is there, the measurements are positive, but at least one of the measurements would have been positive if the diamond wasn’t there, since both diamond detectors and tampering detectors can be used to remove incentives to tamper with measurements. We keep them in the train set, and they account for 50% of the generated data.
Is this (just) because agent would get rewarding for measurements reading the diamond is present? I think I can image cases where agents are incentivized to tamper with measurements even when the diamond is present to make the task of distinguishing tampering more difficult.
- Fabien Roger 4 Oct 2023 15:02 UTC
  2 points
  0
  Parent
  Yes, this is assuming you would reward the agent based on whether the MTD tells you if the diamond is there or not. I don’t see how this clearly incentivizes the model to make tampering happen more often in cases where the diamond is present—I would expect such behavior to create more false negatives (the diamond is there, but the predictor thinks it is not), which is penalized since the agent is wrongfully punished for not getting a diamond, and I don’t see how it would help to create false positives (the diamond is not there, but the predictor thinks it is).