Sam Marks comments on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks 21 Oct 2024 16:41 UTC
2 points
0
Something like this is the hope, though it’s a bit tricky because features that represent “human expert level intelligence” might be hard to distinguish from features for “actually correct” using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.