I expect (1) and (2) to go mostly fine (though see footnote[6]), and for (3) to completely fail. For example, suppose Cb has a bunch of features for things like “true according to smart humans.” How will we distinguish those from features for “true”? I think there’s a good chance that this approach will reduce the problem of discriminating Cg vs.Cb to an equally-difficult problem of discriminating desired vs. undesired features.
However, if we found that the classifier was using a feature for “How smart is the human asking the question?” to decide what answer to give (as opposed to how to then phrase it), that would be a red flag.
Something like this is the hope, though it’s a bit tricky because features that represent “human expert level intelligence” might be hard to distinguish from features for “actually correct” using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.
However, if we found that the classifier was using a feature for “How smart is the human asking the question?” to decide what answer to give (as opposed to how to then phrase it), that would be a red flag.
Something like this is the hope, though it’s a bit tricky because features that represent “human expert level intelligence” might be hard to distinguish from features for “actually correct” using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.