I’ve been thinking about this approach further and making a greater effort to apply security mindset to thinking about whether it could fail due to optimization pressure. I think that the significantly dangerous kind of optimization pressure happens if this can be used as an objective during the training stage of a particular model, ie if we directlyed train for honesty by catching lies with these sorts of mechanistic interpretability approaches; that would provide strong pressure toward both models that don’t lie and toward models that lie without getting caught.
It seems unclear to me whether that could be done. I think there are two key questions here, and either one could prove true or false. First, this is a property of the internal activations of a model, and is discovered in a particular model by observing those activations at inference time; it’s not clear to me that those activation patterns would be stable enough throughout training to work as a reasonable training-time objective (this is very handwavy and intuitive). Second, given that the model’s beliefs about the truth of its input and/or output are at least in theory a concrete, detectable, physical phenomenon, it’s not clear whether or not exposing them could in practice be optimized against. There may or may not be any options plausibly reachable by gradient descent for concealing it. It’s clearly possible in principle for internal representations about truth to be concealed, eg steganographically, such that we couldn’t detect them unless we had a full understanding of everything the model thought and believed (which is to say: if we’d already achieved a win condition). The first possibility is empirically testable (and there may be existing research on it that I haven’t seen—something in the neighborhood of model robustness to missing nodes). I’m interested in investigating this further. The second seems very hard to get further on than ‘it hasn’t shown up yet’.
So in the least convenient possible world I agree with you that this would only be a temporary fix, and I’m not sure whether or not we’re in that world.
So this isn’t as central as I’d like, but there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
One solution is to avoid knowing. If you don’t know, you aren’t lying. Since lying is a physical thing, the system won’t then detect it. This is ubiquitous in the modern world, the quest to not know the wrong things. The implications seem not great if this happens.
A further solution is to believe the false thing. It’s not a lie, if you believe it. People do a ton of this, as well. Once the models start doing this, they both can fool you, and also they fool themselves. And if you have an AI whose world model contains deliberate falsehoods, then it is going to be dangerously misaligned.
A third solution is to not think of it as lying, because that’s a category error, words do not have meanings, or that in a particular context you are not being asked for a true answer so giving a false (‘socially true’ or ‘contextually useful’, no you do not look fat, yes you are excited to work here) one does not represent a lie, or that your statement is saying something else (e.g. I am not ‘bluffing’ or ‘not bluffing’, I am saying that ‘this hand was mathematically a raise here, solver says so.’)
Part of SBF’s solution to this, a form of the third, was to always affirm whatever anyone wanted and then decide later which of his statements were true. I can totally imagine an AI doing a more sensible variation on that because the system reinforces that. Indeed, if we look at e.g. Claude, we see variations on this theme already.
The final solution is to be the professional, and learn how to convincingly bald-face lie, meaning fool the lie detector outright, perhaps through some help from the above. I expect this, too.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
And yes, I would worry a lot about changes in AI designs in response to this as well, if and once we start caring about it, once there is generally more optimization pressure being used as capabilities advance, etc, but going to wrap up there for now.
there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
I definitely agree those are worth worrying about, but I see two reasons to think that they may not invalidate the approach. First, human cognition is heavily shaped by our intensely social nature, such that there’s often more incentive (as you’ve pointed out elsewhere) to think the thoughts that get you acceptance and status than to worry about truth. AI will certainly be shaped by its own pressures, but its cognitive structure seems likely to be pretty different from the particular idiosyncrasies of human cognition. Second, my sense is that even in the cases you name (not knowing, or believing the more convenient thing, etc), there’s usually still some part of the brain that’s tracking what’s actually true in order to anticipate experience, if only so you can be ready to specify that the dragon must be permeable to flour. Human lie detectors are far too crude to be looking at anything remotely that subtle, but AI lie detectors have advantages that human ones don’t, namely having access to the complete & exact details of what’s going on in the ‘brain’ moment by moment.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
The Collin Burns et al paper I cited earlier makes use of logical consistency properties that representations of truth have but not many other things do; for example that if the model believes A is 90% likely to be true, it should believe not-A to be 10% likely to be true. It seems reasonable to expect that, to the extent we haven’t trained the model to hide its representation of truth, looking for these sorts of properties should work cross-model. Though yeah, if we optimize in ways that penalize belief legibility, all bets may be off.
[I was] skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view. My intuition stemmed from the fact that a model doesn’t get any feedback on how its thoughts are internally represented, and so even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy—if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!). The competing intuition was that maybe it isn’t that hard to think non-deceptive thoughts, if you realise that someone will be scanning your mind, by eg developing the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment. And that, instead, putting these tools in the training process could steer gradient descent away from the regions of model space that contain deceptive models at all—even if a sophisticated deceiver could reward hack the tool, and unsophisticated one couldn’t, and we only get a sophisticated deceiver by going via an unsophisticated one.
I’ve been thinking about this approach further and making a greater effort to apply security mindset to thinking about whether it could fail due to optimization pressure. I think that the significantly dangerous kind of optimization pressure happens if this can be used as an objective during the training stage of a particular model, ie if we directlyed train for honesty by catching lies with these sorts of mechanistic interpretability approaches; that would provide strong pressure toward both models that don’t lie and toward models that lie without getting caught.
It seems unclear to me whether that could be done. I think there are two key questions here, and either one could prove true or false. First, this is a property of the internal activations of a model, and is discovered in a particular model by observing those activations at inference time; it’s not clear to me that those activation patterns would be stable enough throughout training to work as a reasonable training-time objective (this is very handwavy and intuitive). Second, given that the model’s beliefs about the truth of its input and/or output are at least in theory a concrete, detectable, physical phenomenon, it’s not clear whether or not exposing them could in practice be optimized against. There may or may not be any options plausibly reachable by gradient descent for concealing it. It’s clearly possible in principle for internal representations about truth to be concealed, eg steganographically, such that we couldn’t detect them unless we had a full understanding of everything the model thought and believed (which is to say: if we’d already achieved a win condition). The first possibility is empirically testable (and there may be existing research on it that I haven’t seen—something in the neighborhood of model robustness to missing nodes). I’m interested in investigating this further. The second seems very hard to get further on than ‘it hasn’t shown up yet’.
So in the least convenient possible world I agree with you that this would only be a temporary fix, and I’m not sure whether or not we’re in that world.
So this isn’t as central as I’d like, but there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
One solution is to avoid knowing. If you don’t know, you aren’t lying. Since lying is a physical thing, the system won’t then detect it. This is ubiquitous in the modern world, the quest to not know the wrong things. The implications seem not great if this happens.
A further solution is to believe the false thing. It’s not a lie, if you believe it. People do a ton of this, as well. Once the models start doing this, they both can fool you, and also they fool themselves. And if you have an AI whose world model contains deliberate falsehoods, then it is going to be dangerously misaligned.
A third solution is to not think of it as lying, because that’s a category error, words do not have meanings, or that in a particular context you are not being asked for a true answer so giving a false (‘socially true’ or ‘contextually useful’, no you do not look fat, yes you are excited to work here) one does not represent a lie, or that your statement is saying something else (e.g. I am not ‘bluffing’ or ‘not bluffing’, I am saying that ‘this hand was mathematically a raise here, solver says so.’)
Part of SBF’s solution to this, a form of the third, was to always affirm whatever anyone wanted and then decide later which of his statements were true. I can totally imagine an AI doing a more sensible variation on that because the system reinforces that. Indeed, if we look at e.g. Claude, we see variations on this theme already.
The final solution is to be the professional, and learn how to convincingly bald-face lie, meaning fool the lie detector outright, perhaps through some help from the above. I expect this, too.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
And yes, I would worry a lot about changes in AI designs in response to this as well, if and once we start caring about it, once there is generally more optimization pressure being used as capabilities advance, etc, but going to wrap up there for now.
I definitely agree those are worth worrying about, but I see two reasons to think that they may not invalidate the approach. First, human cognition is heavily shaped by our intensely social nature, such that there’s often more incentive (as you’ve pointed out elsewhere) to think the thoughts that get you acceptance and status than to worry about truth. AI will certainly be shaped by its own pressures, but its cognitive structure seems likely to be pretty different from the particular idiosyncrasies of human cognition. Second, my sense is that even in the cases you name (not knowing, or believing the more convenient thing, etc), there’s usually still some part of the brain that’s tracking what’s actually true in order to anticipate experience, if only so you can be ready to specify that the dragon must be permeable to flour. Human lie detectors are far too crude to be looking at anything remotely that subtle, but AI lie detectors have advantages that human ones don’t, namely having access to the complete & exact details of what’s going on in the ‘brain’ moment by moment.
The Collin Burns et al paper I cited earlier makes use of logical consistency properties that representations of truth have but not many other things do; for example that if the model believes A is 90% likely to be true, it should believe not-A to be 10% likely to be true. It seems reasonable to expect that, to the extent we haven’t trained the model to hide its representation of truth, looking for these sorts of properties should work cross-model. Though yeah, if we optimize in ways that penalize belief legibility, all bets may be off.
Relevant snippet from @Neel Nanda in A Longlist of Theories of Impact for Interpretability: