Sammy Martin comments on Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Sammy Martin 26 Jul 2023 17:26 UTC
LW: 5 AF: 3
0
AF
Excellent! In particular, it seems like oversight techniques which can pass tests like these could work in worlds where alignment is very difficult, so long as AI progress doesn’t involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes).
I’d say this corresponds to 7 on my alignment difficulty table.
- Buck 26 Jul 2023 17:33 UTC
  LW: 5 AF: 1
  3
  AF Parent
  What do you mean by “so long as AI progress doesn’t involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes)”? I’m not proposing that we rely on any analogies between low and high capability regimes.
  - Sammy Martin 26 Jul 2023 17:41 UTC
    LW: 2 AF: 1
    0
    AF Parent
    In the sense that there has to be an analogy between low and high capabilities somewhere, even if at the meta level.
    This method lets you catch dangerous models that can break oversight processes for the same fundamental reasons as less dangerous models, not just for the same inputs.
    - Buck 26 Jul 2023 17:44 UTC
      LW: 5 AF: 1
      7
      AF Parent
      Why does there have to be an analogy between low and high capabilities somewhere? The goal of this method is to evaluate the adequacy of an oversight procedure for a particular model on a particular distribution; it does not involve generalizing from lower capabilities to higher capabilities, you just run this eval on the actual model you cared about measuring local adequacy of oversight for.
      (Sorry to come in disagreeably here, but I think of this as a really important point, so want to make sure we argue it out.)
      - Sammy Martin 26 Jul 2023 17:47 UTC
        LW: 2 AF: 1
        0
        AF Parent
        You’re right, I’ve reread the section and that was a slight misunderstanding on my part.
        Even so I still think it falls at a 7 on my scale as it’s a way of experimentally validating oversight processes that gives you some evidence about how they’ll work in unseen situations.
        Buck 26 Jul 2023 17:54 UTC
        LW: 5 AF: 1
        3
        AF Parent
        I’d say the main point here is that I don’t want to rely on my ability to extrapolate anything about how the model behaves in “unseen situations”, I want to run this eval in every situation where I’m deploying my model.