Something like this is the hope, though it’s a bit tricky because features that represent “human expert level intelligence” might be hard to distinguish from features for “actually correct” using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.
Something like this is the hope, though it’s a bit tricky because features that represent “human expert level intelligence” might be hard to distinguish from features for “actually correct” using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.