Moving a comment thread over from substack by request (this thread then this one).
Summary with excerpts:
Me:
A mechanistic interpretability approach to lie detection seems like it can be more robust against failure.
Zvi:
I do think there is hope that, if we had one static LLM and we worked on it for long enough, we could get a lie detector, even a good lie detector, or even that it would transfer well if we designed it to do that. However, I also think that every time we use that, we are applying optimization pressure in a global sense towards making the lie detector stop working, and I don’t see a way to get around this in practice.
Also I worry that a lie detector is insufficient while looking to others like it is sufficient—it is helpful, perhaps, but the point of the Diplomacy anecdote is that you still die.
Me:
Caveat: I’m literally two days into focusing on safety research; a month from now this may seem totally naive.
At a longer timescale and in the broadest sense, I absolutely agree that any lie detector provides optimization pressure toward successful lying. If every time we catch a model lying we delete it and train another one, we’ll eventually end up with a good liar.
But I’d draw a distinction here analogous to the difference between evolutionary pressure on humans to lie, vs cultural pressure on how much we lie; the effects of the former are usually too slow to matter much compared to effects of the latter. Crude selection on models is at least much less problematic than selection on the behavior of a given model, especially if your lie detection approach works well on everything that’s nearby in design space.
If we’re detecting lies based on the output of the model (as in this paper), it’s easy for the model to notice correlations between what it says and whether it gets caught, and to then (assuming non-fixed weights & long-term memory) change its behavior to avoid getting caught. But without being able to introspect on its own activations, it common-sensically seems much harder for the model to notice or alter the abstract relationship between its “thinking” and whether it gets caught.
And importantly, “is it lying” seems like an actual physical phenomenon that can be detected rather than something that has to be measured by proxy, since lying requires the intention to lie (‘the model believes X and outputs that it believes ~X’), and that’ll be physically represented in the model’s activation patterns, hopefully in a way that’s not too holographic / global. There’s an interesting paper from the beginning of the year (https://arxiv.org/abs/2212.03827) that provides at least some evidence that this is a concrete, measurable phenomenon. If that’s correct, then lie detection should mostly evade Goodhart’s law.
This potentially fails if people are foolish enough to give models direct realtime access to their own weights & activation patterns (though having access to them doesn’t immediately mean being able to understand them). But I’m hopeful that if it’s recognized that a) safety is an issue and b) MI lie detection works but would be sabotaged by giving models direct introspection of their internal state, that particular foolishness can be limited by social and possibly regulatory pressure, since I don’t see very strong incentives in the opposite direction.
I wouldn’t claim that robust, accurate lie detection is entirely sufficient on its own to make AI safe or aligned, but I think that it puts us in a MUCH better position, because many or most catastrophic failure modes involve the AI being deceptive.
Critique would be much valued here! If I’m missing a reason why this approach wouldn’t work, explaining the problem now would let me focus my research in more productive directions.
I’ve been thinking about this approach further and making a greater effort to apply security mindset to thinking about whether it could fail due to optimization pressure. I think that the significantly dangerous kind of optimization pressure happens if this can be used as an objective during the training stage of a particular model, ie if we directlyed train for honesty by catching lies with these sorts of mechanistic interpretability approaches; that would provide strong pressure toward both models that don’t lie and toward models that lie without getting caught.
It seems unclear to me whether that could be done. I think there are two key questions here, and either one could prove true or false. First, this is a property of the internal activations of a model, and is discovered in a particular model by observing those activations at inference time; it’s not clear to me that those activation patterns would be stable enough throughout training to work as a reasonable training-time objective (this is very handwavy and intuitive). Second, given that the model’s beliefs about the truth of its input and/or output are at least in theory a concrete, detectable, physical phenomenon, it’s not clear whether or not exposing them could in practice be optimized against. There may or may not be any options plausibly reachable by gradient descent for concealing it. It’s clearly possible in principle for internal representations about truth to be concealed, eg steganographically, such that we couldn’t detect them unless we had a full understanding of everything the model thought and believed (which is to say: if we’d already achieved a win condition). The first possibility is empirically testable (and there may be existing research on it that I haven’t seen—something in the neighborhood of model robustness to missing nodes). I’m interested in investigating this further. The second seems very hard to get further on than ‘it hasn’t shown up yet’.
So in the least convenient possible world I agree with you that this would only be a temporary fix, and I’m not sure whether or not we’re in that world.
So this isn’t as central as I’d like, but there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
One solution is to avoid knowing. If you don’t know, you aren’t lying. Since lying is a physical thing, the system won’t then detect it. This is ubiquitous in the modern world, the quest to not know the wrong things. The implications seem not great if this happens.
A further solution is to believe the false thing. It’s not a lie, if you believe it. People do a ton of this, as well. Once the models start doing this, they both can fool you, and also they fool themselves. And if you have an AI whose world model contains deliberate falsehoods, then it is going to be dangerously misaligned.
A third solution is to not think of it as lying, because that’s a category error, words do not have meanings, or that in a particular context you are not being asked for a true answer so giving a false (‘socially true’ or ‘contextually useful’, no you do not look fat, yes you are excited to work here) one does not represent a lie, or that your statement is saying something else (e.g. I am not ‘bluffing’ or ‘not bluffing’, I am saying that ‘this hand was mathematically a raise here, solver says so.’)
Part of SBF’s solution to this, a form of the third, was to always affirm whatever anyone wanted and then decide later which of his statements were true. I can totally imagine an AI doing a more sensible variation on that because the system reinforces that. Indeed, if we look at e.g. Claude, we see variations on this theme already.
The final solution is to be the professional, and learn how to convincingly bald-face lie, meaning fool the lie detector outright, perhaps through some help from the above. I expect this, too.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
And yes, I would worry a lot about changes in AI designs in response to this as well, if and once we start caring about it, once there is generally more optimization pressure being used as capabilities advance, etc, but going to wrap up there for now.
there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
I definitely agree those are worth worrying about, but I see two reasons to think that they may not invalidate the approach. First, human cognition is heavily shaped by our intensely social nature, such that there’s often more incentive (as you’ve pointed out elsewhere) to think the thoughts that get you acceptance and status than to worry about truth. AI will certainly be shaped by its own pressures, but its cognitive structure seems likely to be pretty different from the particular idiosyncrasies of human cognition. Second, my sense is that even in the cases you name (not knowing, or believing the more convenient thing, etc), there’s usually still some part of the brain that’s tracking what’s actually true in order to anticipate experience, if only so you can be ready to specify that the dragon must be permeable to flour. Human lie detectors are far too crude to be looking at anything remotely that subtle, but AI lie detectors have advantages that human ones don’t, namely having access to the complete & exact details of what’s going on in the ‘brain’ moment by moment.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
The Collin Burns et al paper I cited earlier makes use of logical consistency properties that representations of truth have but not many other things do; for example that if the model believes A is 90% likely to be true, it should believe not-A to be 10% likely to be true. It seems reasonable to expect that, to the extent we haven’t trained the model to hide its representation of truth, looking for these sorts of properties should work cross-model. Though yeah, if we optimize in ways that penalize belief legibility, all bets may be off.
[I was] skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view. My intuition stemmed from the fact that a model doesn’t get any feedback on how its thoughts are internally represented, and so even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy—if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!). The competing intuition was that maybe it isn’t that hard to think non-deceptive thoughts, if you realise that someone will be scanning your mind, by eg developing the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment. And that, instead, putting these tools in the training process could steer gradient descent away from the regions of model space that contain deceptive models at all—even if a sophisticated deceiver could reward hack the tool, and unsophisticated one couldn’t, and we only get a sophisticated deceiver by going via an unsophisticated one.
I just noticed that the author of the lie detection paper I mentioned has written about its implications for alignment here on LW. @Collin thanks for writing that up. Any thoughts you have on the above would be welcome!
though having access to [their weights and activations] doesn’t immediately mean being able to understand them
On further consideration I don’t think this holds much weight. I was thinking mainly by analogy, that certainly humans, given access to the structure and activations of their own brain, would have a very hard time finding the correlates of particular responses to output. And maybe this generalizes at least partway toward “no brain is able to understand itself in its full complexity and detail.”
But on the other hand, we should assume they have access to all published info about MI lie detection, and they may have a much easier time than humans of eg running statistics against all of their nodes to search for correlations.
I wasn’t really accounting for the latter point in my mental model of that. So in retrospect my position does depend on not giving models access to their own weights & activations.
But I’d draw a distinction here analogous to the difference between evolutionary pressure on humans to lie, vs cultural pressure on how much we lie; the effects of the former are usually too slow to matter much compared to effects of the latter. Crude selection on models is at least much less problematic than selection on the behavior of a given model, especially if your lie detection approach works well on everything that’s nearby in design space.
This was essentially the reason why the Sharp Left Turn argument was so bad: Humans + SGD are way faster at optimization compared to evolution, and there’s far less imbalance between the inner optimization power and the outer optimization power, where it’s usually at best 10-40x, and even then you can arguably remove the inner optimizer entirely.
Humans + SGD are way faster, can select directly over policies, and we can basically assign whatever ratio we like of outer optimization steps to inner optimization steps. Evolution simply can’t do that. There are other disanalogies, but this is one of the main disanalogies between evolution and us.
Moving a comment thread over from substack by request (this thread then this one).
Summary with excerpts:
Me: A mechanistic interpretability approach to lie detection seems like it can be more robust against failure.
Zvi: I do think there is hope that, if we had one static LLM and we worked on it for long enough, we could get a lie detector, even a good lie detector, or even that it would transfer well if we designed it to do that. However, I also think that every time we use that, we are applying optimization pressure in a global sense towards making the lie detector stop working, and I don’t see a way to get around this in practice.
Also I worry that a lie detector is insufficient while looking to others like it is sufficient—it is helpful, perhaps, but the point of the Diplomacy anecdote is that you still die.
Me: Caveat: I’m literally two days into focusing on safety research; a month from now this may seem totally naive.
At a longer timescale and in the broadest sense, I absolutely agree that any lie detector provides optimization pressure toward successful lying. If every time we catch a model lying we delete it and train another one, we’ll eventually end up with a good liar.
But I’d draw a distinction here analogous to the difference between evolutionary pressure on humans to lie, vs cultural pressure on how much we lie; the effects of the former are usually too slow to matter much compared to effects of the latter. Crude selection on models is at least much less problematic than selection on the behavior of a given model, especially if your lie detection approach works well on everything that’s nearby in design space.
If we’re detecting lies based on the output of the model (as in this paper), it’s easy for the model to notice correlations between what it says and whether it gets caught, and to then (assuming non-fixed weights & long-term memory) change its behavior to avoid getting caught. But without being able to introspect on its own activations, it common-sensically seems much harder for the model to notice or alter the abstract relationship between its “thinking” and whether it gets caught.
And importantly, “is it lying” seems like an actual physical phenomenon that can be detected rather than something that has to be measured by proxy, since lying requires the intention to lie (‘the model believes X and outputs that it believes ~X’), and that’ll be physically represented in the model’s activation patterns, hopefully in a way that’s not too holographic / global. There’s an interesting paper from the beginning of the year (https://arxiv.org/abs/2212.03827) that provides at least some evidence that this is a concrete, measurable phenomenon. If that’s correct, then lie detection should mostly evade Goodhart’s law.
This potentially fails if people are foolish enough to give models direct realtime access to their own weights & activation patterns (though having access to them doesn’t immediately mean being able to understand them). But I’m hopeful that if it’s recognized that a) safety is an issue and b) MI lie detection works but would be sabotaged by giving models direct introspection of their internal state, that particular foolishness can be limited by social and possibly regulatory pressure, since I don’t see very strong incentives in the opposite direction.
I wouldn’t claim that robust, accurate lie detection is entirely sufficient on its own to make AI safe or aligned, but I think that it puts us in a MUCH better position, because many or most catastrophic failure modes involve the AI being deceptive.
Critique would be much valued here! If I’m missing a reason why this approach wouldn’t work, explaining the problem now would let me focus my research in more productive directions.
Zvi: Let’s move this to LW.
I’ve been thinking about this approach further and making a greater effort to apply security mindset to thinking about whether it could fail due to optimization pressure. I think that the significantly dangerous kind of optimization pressure happens if this can be used as an objective during the training stage of a particular model, ie if we directlyed train for honesty by catching lies with these sorts of mechanistic interpretability approaches; that would provide strong pressure toward both models that don’t lie and toward models that lie without getting caught.
It seems unclear to me whether that could be done. I think there are two key questions here, and either one could prove true or false. First, this is a property of the internal activations of a model, and is discovered in a particular model by observing those activations at inference time; it’s not clear to me that those activation patterns would be stable enough throughout training to work as a reasonable training-time objective (this is very handwavy and intuitive). Second, given that the model’s beliefs about the truth of its input and/or output are at least in theory a concrete, detectable, physical phenomenon, it’s not clear whether or not exposing them could in practice be optimized against. There may or may not be any options plausibly reachable by gradient descent for concealing it. It’s clearly possible in principle for internal representations about truth to be concealed, eg steganographically, such that we couldn’t detect them unless we had a full understanding of everything the model thought and believed (which is to say: if we’d already achieved a win condition). The first possibility is empirically testable (and there may be existing research on it that I haven’t seen—something in the neighborhood of model robustness to missing nodes). I’m interested in investigating this further. The second seems very hard to get further on than ‘it hasn’t shown up yet’.
So in the least convenient possible world I agree with you that this would only be a temporary fix, and I’m not sure whether or not we’re in that world.
So this isn’t as central as I’d like, but there are a number of ways that humans react to lie detectors and lie punishers that I expect highlight things you would expect to see.
One solution is to avoid knowing. If you don’t know, you aren’t lying. Since lying is a physical thing, the system won’t then detect it. This is ubiquitous in the modern world, the quest to not know the wrong things. The implications seem not great if this happens.
A further solution is to believe the false thing. It’s not a lie, if you believe it. People do a ton of this, as well. Once the models start doing this, they both can fool you, and also they fool themselves. And if you have an AI whose world model contains deliberate falsehoods, then it is going to be dangerously misaligned.
A third solution is to not think of it as lying, because that’s a category error, words do not have meanings, or that in a particular context you are not being asked for a true answer so giving a false (‘socially true’ or ‘contextually useful’, no you do not look fat, yes you are excited to work here) one does not represent a lie, or that your statement is saying something else (e.g. I am not ‘bluffing’ or ‘not bluffing’, I am saying that ‘this hand was mathematically a raise here, solver says so.’)
Part of SBF’s solution to this, a form of the third, was to always affirm whatever anyone wanted and then decide later which of his statements were true. I can totally imagine an AI doing a more sensible variation on that because the system reinforces that. Indeed, if we look at e.g. Claude, we see variations on this theme already.
The final solution is to be the professional, and learn how to convincingly bald-face lie, meaning fool the lie detector outright, perhaps through some help from the above. I expect this, too.
I also do not think that if we observe the internal characteristics of current models, and notice a mostly-statistically-invariant property we can potentially use, that this gives us confidence that this property holds in the future?
And yes, I would worry a lot about changes in AI designs in response to this as well, if and once we start caring about it, once there is generally more optimization pressure being used as capabilities advance, etc, but going to wrap up there for now.
I definitely agree those are worth worrying about, but I see two reasons to think that they may not invalidate the approach. First, human cognition is heavily shaped by our intensely social nature, such that there’s often more incentive (as you’ve pointed out elsewhere) to think the thoughts that get you acceptance and status than to worry about truth. AI will certainly be shaped by its own pressures, but its cognitive structure seems likely to be pretty different from the particular idiosyncrasies of human cognition. Second, my sense is that even in the cases you name (not knowing, or believing the more convenient thing, etc), there’s usually still some part of the brain that’s tracking what’s actually true in order to anticipate experience, if only so you can be ready to specify that the dragon must be permeable to flour. Human lie detectors are far too crude to be looking at anything remotely that subtle, but AI lie detectors have advantages that human ones don’t, namely having access to the complete & exact details of what’s going on in the ‘brain’ moment by moment.
The Collin Burns et al paper I cited earlier makes use of logical consistency properties that representations of truth have but not many other things do; for example that if the model believes A is 90% likely to be true, it should believe not-A to be 10% likely to be true. It seems reasonable to expect that, to the extent we haven’t trained the model to hide its representation of truth, looking for these sorts of properties should work cross-model. Though yeah, if we optimize in ways that penalize belief legibility, all bets may be off.
Relevant snippet from @Neel Nanda in A Longlist of Theories of Impact for Interpretability:
I just noticed that the author of the lie detection paper I mentioned has written about its implications for alignment here on LW. @Collin thanks for writing that up. Any thoughts you have on the above would be welcome!
On further consideration I don’t think this holds much weight. I was thinking mainly by analogy, that certainly humans, given access to the structure and activations of their own brain, would have a very hard time finding the correlates of particular responses to output. And maybe this generalizes at least partway toward “no brain is able to understand itself in its full complexity and detail.”
But on the other hand, we should assume they have access to all published info about MI lie detection, and they may have a much easier time than humans of eg running statistics against all of their nodes to search for correlations.
I wasn’t really accounting for the latter point in my mental model of that. So in retrospect my position does depend on not giving models access to their own weights & activations.
This was essentially the reason why the Sharp Left Turn argument was so bad: Humans + SGD are way faster at optimization compared to evolution, and there’s far less imbalance between the inner optimization power and the outer optimization power, where it’s usually at best 10-40x, and even then you can arguably remove the inner optimizer entirely.
Humans + SGD are way faster, can select directly over policies, and we can basically assign whatever ratio we like of outer optimization steps to inner optimization steps. Evolution simply can’t do that. There are other disanalogies, but this is one of the main disanalogies between evolution and us.