What’s the specific way you imagine this failing? Some options:
My proposed list (which borrows from your list):
We find a large number (e.g. 30,000) of features which all sorta look somewhat like truth, though none exactly look like truth. Further analysis doesn’t make it clear which of these are “real” or “actually truth”. Some features look more like truth and some look a bit less like truth, but broadly there is a smooth fall of in how “truth like” the features look such that there aren’t a small set of discrete truth features. No single feature both looks like truth and correlates perfectly with our labeled datasets.
We get a bunch of features that look (at least somewhat) like truth and we have some vague sense of how they’re computed, but they don’t seem differentiated in how “sketchy” these computational graphs look: either they all seem to rely on social reasoning or they all don’t seem to.
We get a bunch of features that look like truth, but looking at what they connect to doesn’t make much sense and just makes us more confused overall. There are many diffuse connections and it’s unclear what they do.
Everything looks fine and we apply the method, but it turns out there isn’t any feature we’ve identified which corresponds to “actual truth” as this isn’t very salient for the model in the regime we are interested in.
My proposed list (which borrows from your list):
We find a large number (e.g. 30,000) of features which all sorta look somewhat like truth, though none exactly look like truth. Further analysis doesn’t make it clear which of these are “real” or “actually truth”. Some features look more like truth and some look a bit less like truth, but broadly there is a smooth fall of in how “truth like” the features look such that there aren’t a small set of discrete truth features. No single feature both looks like truth and correlates perfectly with our labeled datasets.
We get a bunch of features that look (at least somewhat) like truth and we have some vague sense of how they’re computed, but they don’t seem differentiated in how “sketchy” these computational graphs look: either they all seem to rely on social reasoning or they all don’t seem to.
We get a bunch of features that look like truth, but looking at what they connect to doesn’t make much sense and just makes us more confused overall. There are many diffuse connections and it’s unclear what they do.
Everything looks fine and we apply the method, but it turns out there isn’t any feature we’ve identified which corresponds to “actual truth” as this isn’t very salient for the model in the regime we are interested in.