I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won’t have a chance to write up thoughts here more cleanly for a few days.
ETA: Briefly, the key points are:
Honesty issues for introspection aren’t obviously much worse than they are for simple probing. (But fair if you’re already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it’s probably quite difficult for a model to tell on which inputs it can get away with lying.
Something like this is the hope, though it’s a bit tricky because features that represent “human expert level intelligence” might be hard to distinguish from features for “actually correct” using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.