abramdemski comments on Some Hacky ELK Ideas

abramdemski 15 Feb 2022 17:50 UTC
LW: 12 AF: 7
AF
Wait, so, what do you actually do with the holdout data? Your stated proposal doesn’t seem to do anything with it. But, clearly, data that’s simply held out forever is of no use to us.
It seems like this holdout data is the sort of precaution which can be used once. When we see (predicted) sensor tampering, we shut the whole project down. If we use that information to iterate on our design at all we enter into dangerous territory: we’re now optimizing the whole setup to avoid that kind of discrepancy, which means it may become useless for detecting tampering.
From this perspective, your proposal seems less like a recipe for solving ELK, and more like a recipe for knowing that you’ve failed to solve ELK (a one-time-use recipe, at that).
- johnswentworth 15 Feb 2022 18:05 UTC
  LW: 14 AF: 9
  AF Parent
  That is exactly correct, yes.
- Hoagy 16 Feb 2022 14:19 UTC
  LW: 3 AF: 1
  AF Parent
  I see John agrees with the ‘one-time’ label but it seems a bit too strong to me, especially if the kind of optimization is ‘lets try a totally different approach’, rather than continuing to train the same system, or focusing on exactly why it spoofed one sensor but not the other. Just to think it through:
  There are three types of system that are important: type A which fails on the validation/holdout data, type B which succeeds on validation but not test/real-world data, and type C, which succeeds on both. We are looking for type C, and we use the validation data to distinguish A from either B or C.
  Naively, waiting longer for a system that is not-A wouldn’t have a bearing on whether it is B or C, but upon finding A, we know it is finding the strategy of spoofing sensors, and the more times we find A, the more we suspect this strategy is dominant, and suggests that partial spoofing (B) is more likely than no spoofing (C). Therefore, when we find not-A after a series of As, it is more likely to be B than if we found not-A on our first try.
  I agree with the logic but it seems like our expectation of the B:C ratio will increase smoothly over time, if the holdout sensors are different to non-holdout ones, costly of spoof, and any leakage is minimized (maximizing initial expectations of C:B ratio) then finding not-A seems to be meaningful evidence in favor of C for a while.
  Not to say that this solves ELK, but it seems like it should remain (ever weaker) evidence in favor of honesty for multiple iterations, though I can’t say I know how steep the fall-off should be.
  This could also be extended by having multiple levels of holdout data, the next level being only evaluated once we have sufficient confidence that it is honest (accounting for the declining level of evidence given by previous levels, with the assumption that there are other means of testing).
  - abramdemski 18 Feb 2022 18:05 UTC
    LW: 6 AF: 5
    AF Parent
    I agree. There’s nothing magical about “once”. I almost wrote “once or twice”, but it didn’t sit well with the level of caution I would prefer be the norm. While your analysis seems correct, I am worried if that’s the plan.
    I think a safety team should go into things with the attitude that this type of thing is important a last-line-of-defense, but should never trigger. The plan should involve a strong argument that what’s being build is safe. In fact if this type of safeguard gets triggered, I would want the policy to be to go back to the drawing board, take the new information into account, and come up with a more well-argued plan. The new plan can have “never to be used” safeguards like this, but hopefully it has more and different ones this time.
    If, on the other hand, a safety team goes in with the idea that John’s safeguard can be iterated a few times as you argue, then I anticipate them fooling themselves by iterating too many times and coming up with a plan that accidentally skirts the safeguard in some hard-to-notice way.
    (I have no reason to expect these sorts of anticipations to be calibrated; I’m just thinking cautiously here.)
    - Hoagy 20 Feb 2022 13:52 UTC
      LW: 1 AF: 1
      AF Parent
      Yeah fully agreed.