First, suppose GPT-n literally just has a “what a human would say” feature and a “what do I [as GPT-n] actually believe” feature, and those are the only two consistently useful truth-like features that it represents, and that using our method we can find both of them. This means we literally only need one more bit of information to identify the model’s beliefs.
One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50⁄50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer.[2] This would allow us to identify the model’s beliefs from among these two options.
For n such that GPT-n is superhuman, I think one could alternatively differentiate between these two options by checking which is more consistent under implications, by which I mean that whenever the representation says that the propositions P and P→Q are true, it should also say that Q is true. (Here, for a language model, P and Q could be ~whatever assertions written in natural language.) Or more generally, in addition to modus ponens, also construct new propositions with ANDs and ORs, and check against all the inference rules of zeroth-order logic, or do this for first-order logic or whatever. (Alternatively, we can also write down versions of these constraints that apply to probabilities.) Assuming [more intelligent ⇒ more consistent] (w.r.t. the same set of propositions), for a superhuman model, the model’s beliefs would probably be the more consistent feature. (Of course, one could also just add these additional consistency constraints directly into the loss in CCS instead of doing a second deductive step.)
I think this might even be helpful for differentiating the model’s beliefs from what it models some other clever AI as believing or what it thinks would be true in some fake counterfactual world, because presumably it makes sense to devote less of one’s computation to ironing out incoherence in these counterfactuals – for humans, it certainly seems computationally much easier to consistently tell the truth than to consistently talk about what would be the case in some counterfactual of similar complexity to reality (e.g. to lie).
Hmm, after writing the above, now that I think more of it, I guess it seems plausible that the feature most consistent under negations is already more likely to be the model’s true beliefs, for the same reasons as what’s given in the above paragraph. I guess testing modus ponens (and other inference rules) seems much stronger though, and in any case that could be useful for constraining the search.
(There are a bunch of people that should be thanked for contributing to the above thoughts in discussions, but I’ll hopefully have a post up in a few days where I do that – I’ll try to remember to edit this comment with a link to the post when it’s up.)
For n such that GPT-n is superhuman, I think one could alternatively differentiate between these two options by checking which is more consistent under implications, by which I mean that whenever the representation says that the propositions P and P→Q are true, it should also say that Q is true. (Here, for a language model, P and Q could be ~whatever assertions written in natural language.) Or more generally, in addition to modus ponens, also construct new propositions with ANDs and ORs, and check against all the inference rules of zeroth-order logic, or do this for first-order logic or whatever. (Alternatively, we can also write down versions of these constraints that apply to probabilities.) Assuming [more intelligent ⇒ more consistent] (w.r.t. the same set of propositions), for a superhuman model, the model’s beliefs would probably be the more consistent feature. (Of course, one could also just add these additional consistency constraints directly into the loss in CCS instead of doing a second deductive step.)
I think this might even be helpful for differentiating the model’s beliefs from what it models some other clever AI as believing or what it thinks would be true in some fake counterfactual world, because presumably it makes sense to devote less of one’s computation to ironing out incoherence in these counterfactuals – for humans, it certainly seems computationally much easier to consistently tell the truth than to consistently talk about what would be the case in some counterfactual of similar complexity to reality (e.g. to lie).
Hmm, after writing the above, now that I think more of it, I guess it seems plausible that the feature most consistent under negations is already more likely to be the model’s true beliefs, for the same reasons as what’s given in the above paragraph. I guess testing modus ponens (and other inference rules) seems much stronger though, and in any case that could be useful for constraining the search.
(There are a bunch of people that should be thanked for contributing to the above thoughts in discussions, but I’ll hopefully have a post up in a few days where I do that – I’ll try to remember to edit this comment with a link to the post when it’s up.)
It might be useful to have a look at Language models show human-like content effects on reasoning, they empirically test for human-like incoherences / biases in LMs performing some logical reasoning tasks (twitter summary thread; video presentation)