I see John agrees with the ‘one-time’ label but it seems a bit too strong to me, especially if the kind of optimization is ‘lets try a totally different approach’, rather than continuing to train the same system, or focusing on exactly why it spoofed one sensor but not the other. Just to think it through:
There are three types of system that are important: type A which fails on the validation/holdout data, type B which succeeds on validation but not test/real-world data, and type C, which succeeds on both. We are looking for type C, and we use the validation data to distinguish A from either B or C.
Naively, waiting longer for a system that is not-A wouldn’t have a bearing on whether it is B or C, but upon finding A, we know it is finding the strategy of spoofing sensors, and the more times we find A, the more we suspect this strategy is dominant, and suggests that partial spoofing (B) is more likely than no spoofing (C). Therefore, when we find not-A after a series of As, it is more likely to be B than if we found not-A on our first try.
I agree with the logic but it seems like our expectation of the B:C ratio will increase smoothly over time, if the holdout sensors are different to non-holdout ones, costly of spoof, and any leakage is minimized (maximizing initial expectations of C:B ratio) then finding not-A seems to be meaningful evidence in favor of C for a while.
Not to say that this solves ELK, but it seems like it should remain (ever weaker) evidence in favor of honesty for multiple iterations, though I can’t say I know how steep the fall-off should be.
This could also be extended by having multiple levels of holdout data, the next level being only evaluated once we have sufficient confidence that it is honest (accounting for the declining level of evidence given by previous levels, with the assumption that there are other means of testing).
I agree. There’s nothing magical about “once”. I almost wrote “once or twice”, but it didn’t sit well with the level of caution I would prefer be the norm. While your analysis seems correct, I am worried if that’s the plan.
I think a safety team shouldgo into things with the attitude that this type of thing is important a last-line-of-defense, but should never trigger. The plan should involve a strong argument that what’s being build is safe. In fact if this type of safeguard gets triggered, I would want the policy to be to go back to the drawing board, take the new information into account, and come up with a more well-argued plan. The new plan can have “never to be used” safeguards like this, but hopefully it has more and different ones this time.
If, on the other hand, a safety team goes in with the idea that John’s safeguard can be iterated a few times as you argue, then I anticipate them fooling themselves by iterating too many times and coming up with a plan that accidentally skirts the safeguard in some hard-to-notice way.
(I have no reason to expect these sorts of anticipations to be calibrated; I’m just thinking cautiously here.)
I see John agrees with the ‘one-time’ label but it seems a bit too strong to me, especially if the kind of optimization is ‘lets try a totally different approach’, rather than continuing to train the same system, or focusing on exactly why it spoofed one sensor but not the other. Just to think it through:
There are three types of system that are important: type A which fails on the validation/holdout data, type B which succeeds on validation but not test/real-world data, and type C, which succeeds on both. We are looking for type C, and we use the validation data to distinguish A from either B or C.
Naively, waiting longer for a system that is not-A wouldn’t have a bearing on whether it is B or C, but upon finding A, we know it is finding the strategy of spoofing sensors, and the more times we find A, the more we suspect this strategy is dominant, and suggests that partial spoofing (B) is more likely than no spoofing (C). Therefore, when we find not-A after a series of As, it is more likely to be B than if we found not-A on our first try.
I agree with the logic but it seems like our expectation of the B:C ratio will increase smoothly over time, if the holdout sensors are different to non-holdout ones, costly of spoof, and any leakage is minimized (maximizing initial expectations of C:B ratio) then finding not-A seems to be meaningful evidence in favor of C for a while.
Not to say that this solves ELK, but it seems like it should remain (ever weaker) evidence in favor of honesty for multiple iterations, though I can’t say I know how steep the fall-off should be.
This could also be extended by having multiple levels of holdout data, the next level being only evaluated once we have sufficient confidence that it is honest (accounting for the declining level of evidence given by previous levels, with the assumption that there are other means of testing).
I agree. There’s nothing magical about “once”. I almost wrote “once or twice”, but it didn’t sit well with the level of caution I would prefer be the norm. While your analysis seems correct, I am worried if that’s the plan.
I think a safety team should go into things with the attitude that this type of thing is important a last-line-of-defense, but should never trigger. The plan should involve a strong argument that what’s being build is safe. In fact if this type of safeguard gets triggered, I would want the policy to be to go back to the drawing board, take the new information into account, and come up with a more well-argued plan. The new plan can have “never to be used” safeguards like this, but hopefully it has more and different ones this time.
If, on the other hand, a safety team goes in with the idea that John’s safeguard can be iterated a few times as you argue, then I anticipate them fooling themselves by iterating too many times and coming up with a plan that accidentally skirts the safeguard in some hard-to-notice way.
(I have no reason to expect these sorts of anticipations to be calibrated; I’m just thinking cautiously here.)
Yeah fully agreed.