Roger Dearnaley comments on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Roger Dearnaley 20 Oct 2024 19:36 UTC
LW: 3 AF: 1
0
AF
I expect (1) and (2) to go mostly fine (though see footnote^[6]), and for (3) to completely fail. For example, suppose $C_{b}$ has a bunch of features for things like “true according to smart humans.” How will we distinguish those from features for “true”? I think there’s a good chance that this approach will reduce the problem of discriminating $C_{g}$ vs. $C_{b}$ to an equally-difficult problem of discriminating desired vs. undesired features.
However, if we found that the classifier was using a feature for “How smart is the human asking the question?” to decide what answer to give (as opposed to how to then phrase it), that would be a red flag.
- Sam Marks 21 Oct 2024 16:41 UTC
  2 points
  0
  Parent
  Something like this is the hope, though it’s a bit tricky because features that represent “human expert level intelligence” might be hard to distinguish from features for “actually correct” using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.