I think this is along the right sort of lines. Indeed I think this plan is the sort of thing I hoped to prompt people to think about with the post. But I think there are a few things wrong with it:
i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence. It’s also easy to imagine this being true for some categories of static facts about the external world (e.g paris being in france) but you need to be careful about extending this to the category of all propositional statements (e.g the model thinks that this safeguard is adequate, or the model can’t find any security flaws in this program).
relatedly, your second bullet point assumes that you can identify the ‘fact’ related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
I think that detecting/preventing models from knowingly lying would be a good research direction and it’s clearly related to strategic deception, but I’m not actually sure that it’s a superset (consider a case when I’m bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don’t know or care whether what I’m saying is true or false or whatever).
but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!
and to expand on this a little bit more: it seems important that we hedge against this possibility by at least spending a bit of time thinking about plans that don’t rhyme with ‘I sure hope everything turns out to be a simple correspondence’! I think eleni and i feel that this is a suprisingly widespread move in interpretability plans, which is maybe why some of the post is quite forceful in arguing against it