I think it’s good to assume that there is no tampering in the training set.
In the document we say that we’re worried about the reporter that waits until it sees a good argument that “The human won’t be confident that the diamond isn’t in the room” and then says “the diamond is in the room” as soon as it finds one. We claim that this helps on the training set, and then argue that it would lead to bad behavior given certain kinds of tampering.
But you’re correct that we don’t actually give any examples where this heuristic actually helps. To give one now (credit Mark): suppose that if the diamond is in the room at time T, then at time T+1 it will either be in the room or something confusing will happen that will leave the human unconfident about whether the diamond is still in the room. Then as soon as you figure out that the diamond is in the room at time T, you might as well answer “the diamond is in the room at time T+1” even if you aren’t actually sure of that.
The counterexample you describe has a different flavor but is also valid (both for “depending on downstream variables” and “computation time”)---the reporter can save time by baking in some assumptions that are only true on the training distribution. There are various ways you could try to address this kind of problem, and it seems interesting and important. We don’t get into any of that in the doc. That’s partly because we haven’t really worked through any of the details for any of those approaches, so they would be welcome contributions!
I think it’s good to assume that there is no tampering in the training set.
In the document we say that we’re worried about the reporter that waits until it sees a good argument that “The human won’t be confident that the diamond isn’t in the room” and then says “the diamond is in the room” as soon as it finds one. We claim that this helps on the training set, and then argue that it would lead to bad behavior given certain kinds of tampering.
But you’re correct that we don’t actually give any examples where this heuristic actually helps. To give one now (credit Mark): suppose that if the diamond is in the room at time T, then at time T+1 it will either be in the room or something confusing will happen that will leave the human unconfident about whether the diamond is still in the room. Then as soon as you figure out that the diamond is in the room at time T, you might as well answer “the diamond is in the room at time T+1” even if you aren’t actually sure of that.
The counterexample you describe has a different flavor but is also valid (both for “depending on downstream variables” and “computation time”)---the reporter can save time by baking in some assumptions that are only true on the training distribution. There are various ways you could try to address this kind of problem, and it seems interesting and important. We don’t get into any of that in the doc. That’s partly because we haven’t really worked through any of the details for any of those approaches, so they would be welcome contributions!