So this isn’t as central as I’d like, but there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
One solution is to avoid knowing. If you don’t know, you aren’t lying. Since lying is a physical thing, the system won’t then detect it. This is ubiquitous in the modern world, the quest to not know the wrong things. The implications seem not great if this happens.
A further solution is to believe the false thing. It’s not a lie, if you believe it. People do a ton of this, as well. Once the models start doing this, they both can fool you, and also they fool themselves. And if you have an AI whose world model contains deliberate falsehoods, then it is going to be dangerously misaligned.
A third solution is to not think of it as lying, because that’s a category error, words do not have meanings, or that in a particular context you are not being asked for a true answer so giving a false (‘socially true’ or ‘contextually useful’, no you do not look fat, yes you are excited to work here) one does not represent a lie, or that your statement is saying something else (e.g. I am not ‘bluffing’ or ‘not bluffing’, I am saying that ‘this hand was mathematically a raise here, solver says so.’)
Part of SBF’s solution to this, a form of the third, was to always affirm whatever anyone wanted and then decide later which of his statements were true. I can totally imagine an AI doing a more sensible variation on that because the system reinforces that. Indeed, if we look at e.g. Claude, we see variations on this theme already.
The final solution is to be the professional, and learn how to convincingly bald-face lie, meaning fool the lie detector outright, perhaps through some help from the above. I expect this, too.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
And yes, I would worry a lot about changes in AI designs in response to this as well, if and once we start caring about it, once there is generally more optimization pressure being used as capabilities advance, etc, but going to wrap up there for now.
there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
I definitely agree those are worth worrying about, but I see two reasons to think that they may not invalidate the approach. First, human cognition is heavily shaped by our intensely social nature, such that there’s often more incentive (as you’ve pointed out elsewhere) to think the thoughts that get you acceptance and status than to worry about truth. AI will certainly be shaped by its own pressures, but its cognitive structure seems likely to be pretty different from the particular idiosyncrasies of human cognition. Second, my sense is that even in the cases you name (not knowing, or believing the more convenient thing, etc), there’s usually still some part of the brain that’s tracking what’s actually true in order to anticipate experience, if only so you can be ready to specify that the dragon must be permeable to flour. Human lie detectors are far too crude to be looking at anything remotely that subtle, but AI lie detectors have advantages that human ones don’t, namely having access to the complete & exact details of what’s going on in the ‘brain’ moment by moment.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
The Collin Burns et al paper I cited earlier makes use of logical consistency properties that representations of truth have but not many other things do; for example that if the model believes A is 90% likely to be true, it should believe not-A to be 10% likely to be true. It seems reasonable to expect that, to the extent we haven’t trained the model to hide its representation of truth, looking for these sorts of properties should work cross-model. Though yeah, if we optimize in ways that penalize belief legibility, all bets may be off.
[I was] skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view. My intuition stemmed from the fact that a model doesn’t get any feedback on how its thoughts are internally represented, and so even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy—if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!). The competing intuition was that maybe it isn’t that hard to think non-deceptive thoughts, if you realise that someone will be scanning your mind, by eg developing the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment. And that, instead, putting these tools in the training process could steer gradient descent away from the regions of model space that contain deceptive models at all—even if a sophisticated deceiver could reward hack the tool, and unsophisticated one couldn’t, and we only get a sophisticated deceiver by going via an unsophisticated one.
So this isn’t as central as I’d like, but there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
One solution is to avoid knowing. If you don’t know, you aren’t lying. Since lying is a physical thing, the system won’t then detect it. This is ubiquitous in the modern world, the quest to not know the wrong things. The implications seem not great if this happens.
A further solution is to believe the false thing. It’s not a lie, if you believe it. People do a ton of this, as well. Once the models start doing this, they both can fool you, and also they fool themselves. And if you have an AI whose world model contains deliberate falsehoods, then it is going to be dangerously misaligned.
A third solution is to not think of it as lying, because that’s a category error, words do not have meanings, or that in a particular context you are not being asked for a true answer so giving a false (‘socially true’ or ‘contextually useful’, no you do not look fat, yes you are excited to work here) one does not represent a lie, or that your statement is saying something else (e.g. I am not ‘bluffing’ or ‘not bluffing’, I am saying that ‘this hand was mathematically a raise here, solver says so.’)
Part of SBF’s solution to this, a form of the third, was to always affirm whatever anyone wanted and then decide later which of his statements were true. I can totally imagine an AI doing a more sensible variation on that because the system reinforces that. Indeed, if we look at e.g. Claude, we see variations on this theme already.
The final solution is to be the professional, and learn how to convincingly bald-face lie, meaning fool the lie detector outright, perhaps through some help from the above. I expect this, too.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
And yes, I would worry a lot about changes in AI designs in response to this as well, if and once we start caring about it, once there is generally more optimization pressure being used as capabilities advance, etc, but going to wrap up there for now.
I definitely agree those are worth worrying about, but I see two reasons to think that they may not invalidate the approach. First, human cognition is heavily shaped by our intensely social nature, such that there’s often more incentive (as you’ve pointed out elsewhere) to think the thoughts that get you acceptance and status than to worry about truth. AI will certainly be shaped by its own pressures, but its cognitive structure seems likely to be pretty different from the particular idiosyncrasies of human cognition. Second, my sense is that even in the cases you name (not knowing, or believing the more convenient thing, etc), there’s usually still some part of the brain that’s tracking what’s actually true in order to anticipate experience, if only so you can be ready to specify that the dragon must be permeable to flour. Human lie detectors are far too crude to be looking at anything remotely that subtle, but AI lie detectors have advantages that human ones don’t, namely having access to the complete & exact details of what’s going on in the ‘brain’ moment by moment.
The Collin Burns et al paper I cited earlier makes use of logical consistency properties that representations of truth have but not many other things do; for example that if the model believes A is 90% likely to be true, it should believe not-A to be 10% likely to be true. It seems reasonable to expect that, to the extent we haven’t trained the model to hide its representation of truth, looking for these sorts of properties should work cross-model. Though yeah, if we optimize in ways that penalize belief legibility, all bets may be off.
Relevant snippet from @Neel Nanda in A Longlist of Theories of Impact for Interpretability: