Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)
Of course, many big challenges here. Could an agent implement/learn to deceive the interpretability mechanism? I’m somewhat tautologically going to say that if we solve interpretability, we have solved this problem. Interpretability has value if we can’t fully solve it under this strong definition though.
Hard disagree—deception is behavior that is optimized for, and not necessarily a property of the agent itself.
Take for example CICERO, the Diplomacy AI. It never lies about its intentions, but when its intentions change, it backstabs other players anyways. If you had interpretability tools, you would not be able to see deception in CICERO. All you need to get deception is a false prediction of your own future behavior. I think this is true for humans to a certain extent. I also suspect this is what you get if you optimize away visible signs of deception if deception has utility for the model.
I find the distinction between an agent’s behavior and the agent confusing; I would say the agent’s weights (and ephemeral internal state) determine its behavior in response to a given world state. Perhaps you can clarify what you mean there.
Cicero doesn’t seem particularly relevant here, since it is optimized for a game that requires backstabbing to win, and therefore it backstabs. If anything it is anti-aligned by training. It happens to have learned a “non-deceptive” strategy, I don’t think that strat is unique in Diplomacy?
But if you want to apply the interpretability lens, Cicero is presumably building a world model and comparing plans, including potential future backstabs. I predict if we had full interpretability, you’d see Cicero evaluating backstabs and picking the best strategy, and you could extract the calculated EV to see how close it was to backstabbing on a given turn vs taking an honest move.
I don’t believe that it’s somehow not modeling its backstab options and just comes up with the backstab plan spontaneously without ever having considered it. It would be a bad planner if it had not considered and weighed backstabs at the earliest possibility.
So if all that holds, we could use interpretability to confirm that Cicero is an unreliable partner and should not be further empowered.
I think it is much more interesting to look at agents in environments where long-term iterated cooperation is a valid strategy though.
I mean that deception doesn’t need any recognizable architecture to occur. There doesn’t need to be a deception module or a deception neuron that can be detected, even with perfect interpretability tools. Instead, deception is a behavior that arises from an agent interacting with the environment and other agents. Examples include telling strategic falsehoods (even if you believe them), not following your promises (even if you meant them when you made them), ect. In a broad sense, I think you can define deception as “behaviors typical of agents that actively lie and misrepresent things to their benefit, whether or not the intent to do so actually exists.” It’s a bit circular, but I think it works.
Cicero models the world but with unrealistically cooperative predictions of its future behavior. It does this because long-term iterated cooperation is a valid strategy in Diplomacy. For a Cicero-level agent, lies require more cognitive capacity than just having a few heuristics that make your world model less accurate but your communications more convincing to other agents. I suspect this may be true for more powerful agents, and it is partially true for humans.
(There is an argument that agents like these stop acting deceptively once taken out of their training environments since their heuristics lose coherence and they just act like honest agents with poor world models. I would say that this is true if we consider that modern humans are the training environment.)
And yes, Cicero is considering the EVs of its actions, including deceptive ones. When it sincerely says “I won’t backstab you in situation X”, but when it is actually put in situation X it backstabs, it is in a sense a bad planner. But the bad planning is selected for because it results in more effective communication! This is probably also true for things like “malice” and “misunderstanding”.
I think this is a concern for current LLMs, since they are RLHF’d to be both truthful and high-PR. These are often mutually incompatible, so heuristics that damage the accuracy of its world model and allow it to achieve both may be selected for. Can’t think of any examples right now though.
I don’t think interpretability tools are useless—in fact, I think they are essential. The problem is that we need much more than being able to find a “deception” neuron. We would need an interpretability tool that can say “this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive” without having to simulate the agent interacting with the world. I don’t think that is impossible, but it is probably very hard, much harder than finding a “deception” neuron. And as of right now, we can’t even find “deception” neurons with any accuracy.
There doesn’t need to be a deception module or a deception neuron that can be detected
I agree with this. Perhaps I’m missing some context; is it common to advocate for the existence of a “deception module”? I’m aware of some interpretability work that looks for a “truthiness” neuron but that doesn’t seem like the same concept.
We would need an interpretability tool that can say “this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive” without having to simulate the agent interacting with the world. I don’t think that is impossible, but it is probably very hard, much harder than finding a “deception” neuron.
Right, I was gesturing towards the sort of interpretability where we inspect the agent’s world model (particularly its future predictions) and determine if it matches the agent’s stated goals. (Though I’m not sure why we can’t simulate the agent’s interactions? It seems that running an agent in a simulation is one of the better ways of detecting how it would behave in hypothetical future scenarios that the agent is trying to realize.) I suspect we agree substantively and it’s mostly a question of semantics especially around what “deception” means. I’m not looking for a deception module, rather, I want to observe the thoughts and ruminations going on as an agent performs tasks and detect deceptive functional patterns.
So for example in the ELK paper, is the AI constructing a plan to steal the diamond and fool the security cameras? I believe deceptive ruminations would be detectable; if you could see the contents of the thoughts you’d see a world model with the agent e.g. stealing the diamond, a meta-process evaluating which of many cunning plans is most likely to succeed, and presumably except in pathological cases, somewhere a value function / target world state that is trying to be optimized (eg “I have the diamond”), and this internal world differing from the external claims (perhaps even some processes modeling the interlocutor and predicting what deceptive story would be most effective). These must all be in there somewhere, and therefore be interpretable.
Or perhaps if Cicero is not stretching the analogy too far (I don’t think it has “thoughts”), is Cicero evaluating future juicy backstabs and guiding the game to them, or myopically evaluating short term moves then backstabbing when it’s optimal? This is a question we should be able to answer one way or another with good enough interpretability tools.
I think you are concerned more with something like “unintentional deception”, which I think is quite different, and as you say comes from a lack of self-knowledge or inaccurate self-predicting. I think there is also a semantic grey area here, is everyone who is mistaken about their future behavior or even just claiming X when !X obtains unintentionally deceiving?
Semantics aside, I agree this unintentional case could be harder to detect, and raw interpretability doesn’t automatically solve it.
I think this points to an interesting dynamic—I suspect that the more capable at self-modeling an agent gets, the more likely any deception is to be intentional. To your concerns about LLMs, this seems to be mostly a problem of inadequate self-knowledge (or perhaps more fundamentally an artifact of their lack of a sense of self).
Unintentional deception would require the agent to in some sense fail at modeling itself. There are cases where this could occur even in a superintelligence (since you can never model yourself perfectly due to infinite recursion of modeling an agent that contains the same model) but it seems the problem with Cicero is just that it has a very weak self-model (if it even has one), and improving that self-model could be expected to remove the unintentional deception (replacing it with intentional deception, since deception of some sort is required to play Diplomacy optimally).
For example, the case EY uses as an example “the AGI doesn’t care about you, it just wants your atoms for something else”, do you see more risk from unintended deception (sharp left turn of an agent that didn’t know it would defect if given enough power, perhaps) vs intended deception (agent simply lies about being aligned and manipulates to get power)?
The most tricky cases would be where a normally-honest/good agent that has a strong self-model and sense of ethics can fail to model itself in some specific situation and accidentally deceive (potentially even against its own values, self-image, and interests — humans do this of course). Manchurian Candidate type triggers for example. But note this is quite convoluted; the stronger the sense of self, morality, and world-model, the better able and incentivized the agent is to avoid deceiving itself.
Another interesting question—if an unintentionally deceptive agent is bad at modeling its future self, is it un-threatening in a complex environment like the physical world? Sure, Cicero can defeat humans at Diplomacy with its myopic world model, but in the real world I suspect being that myopic would lead to your bank account being immediately drained by hackers and scammers. Certainly you’d struggle to form partnerships with other agents if you end up stabbing them at the first opportunity. It’s all relative but I view one game of Diplomacy as entailing short-term cooperation (merely over a few hours), what about cooperating over months or years to build a company or other venture?
The question (an empirical one) is whether unintentional deception can persist broadly as agents and the environment get more sophisticated, or whether it gets selected against (and therefore bounded) by the level of environmental complexity and adversarial competition. Charlie Munger would put forth that cognitive bias provides smart self-aware investors a substantial edge over naive self-myopic investors; extrapolate that forward and perhaps there isn’t much room for biased/erroneous word modeling if you seek to compete with other superintelligences, or even the smartest humans.
On to some object-level questions about your Cicero points:
When it sincerely says “I won’t backstab you in situation X”, but when it is actually put in situation X it backstabs
Is this actually confirmed? Does Cicero actually claim “I won’t stab you in <hypothetical scenario>”? Or does it just honestly report its current tactical/strategic goals, which later change? (“Will you support me in attacking here” means I plan to attack here, not that I want you to act as if I was attacking here, and I plan to attack there and stab you instead.) I was as a under the impression it’s the latter, it’s just honestly reporting it’s plan.
Cicero models the world but with unrealistically cooperative predictions of its future behavior
Do we actually know this? I haven’t seen any sign that it’s modeling it’s own verbal behavior or the consequences of its communications (admittedly I haven’t gone deep on the architecture, would be interested to learn more). IIUC it has a strategic model which is evaluating game positions and messages, and a separate (downstream) LLM which could be thought of as an epiphenomenon of the actual strategic reasoning. I don’t see any concrete proof that the strategic planner is modeling the impact of its communication on other players (ie “if I tell Bob X he might do Y which would lead to favorable strategic position P”). And the LM is more like a press secretary—all it gets is The Current Plan, not the plan B or any hint of the juicy backstab that may or may not have been evaluated. It seems to me there is a weird split brain here that deeply impairs its ability to model itself, and so I am skeptical that it is actually meaningfully doing so.
So to summarize I think there are open questions about the object level properties of Cicero, which could be answered by better interpretability tools.
It’s possible that Cicero is simply an honest but myopic opportunist. In which case, still not trustworthy. (As if we were ever going to trust such an obviously unsafe-by-construction entity with real power).
It’s possible that the strategic model is deceptive, and planning backstabs, planning to mislead others, and maybe even explicitly modeling its press secretary to provide the most-spinable plans. I doubt this as it requires a lot more complexity and world modeling.
But I believe we could answer where Cicero is on the spectrum between these two by actually observing the world models being generated when it evaluates a turn. So in this case, inasmuch as there is something similar to an inner alignment issue, it is fully detectable with adequate (strong) interpretability of the inner workings. (Again though, Cicero’s value function is so comically unviable as a basis for an AGI that I struggle to infer too much about inner alignnment. We should be wary of these problems when building a Skynet military planner, I suppose, or any agent that needs to model extremely adversarial opponents and outcomes.)
I think grandparent comment is pointing to the concept described in this post: that deceptiveness is what we humans perceive of the world, not a property of what the model perceives of the world.
Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)
Of course, many big challenges here. Could an agent implement/learn to deceive the interpretability mechanism? I’m somewhat tautologically going to say that if we solve interpretability, we have solved this problem. Interpretability has value if we can’t fully solve it under this strong definition though.
Hard disagree—deception is behavior that is optimized for, and not necessarily a property of the agent itself.
Take for example CICERO, the Diplomacy AI. It never lies about its intentions, but when its intentions change, it backstabs other players anyways. If you had interpretability tools, you would not be able to see deception in CICERO. All you need to get deception is a false prediction of your own future behavior. I think this is true for humans to a certain extent. I also suspect this is what you get if you optimize away visible signs of deception if deception has utility for the model.
I find the distinction between an agent’s behavior and the agent confusing; I would say the agent’s weights (and ephemeral internal state) determine its behavior in response to a given world state. Perhaps you can clarify what you mean there.
Cicero doesn’t seem particularly relevant here, since it is optimized for a game that requires backstabbing to win, and therefore it backstabs. If anything it is anti-aligned by training. It happens to have learned a “non-deceptive” strategy, I don’t think that strat is unique in Diplomacy?
But if you want to apply the interpretability lens, Cicero is presumably building a world model and comparing plans, including potential future backstabs. I predict if we had full interpretability, you’d see Cicero evaluating backstabs and picking the best strategy, and you could extract the calculated EV to see how close it was to backstabbing on a given turn vs taking an honest move.
I don’t believe that it’s somehow not modeling its backstab options and just comes up with the backstab plan spontaneously without ever having considered it. It would be a bad planner if it had not considered and weighed backstabs at the earliest possibility.
So if all that holds, we could use interpretability to confirm that Cicero is an unreliable partner and should not be further empowered.
I think it is much more interesting to look at agents in environments where long-term iterated cooperation is a valid strategy though.
I mean that deception doesn’t need any recognizable architecture to occur. There doesn’t need to be a deception module or a deception neuron that can be detected, even with perfect interpretability tools. Instead, deception is a behavior that arises from an agent interacting with the environment and other agents. Examples include telling strategic falsehoods (even if you believe them), not following your promises (even if you meant them when you made them), ect. In a broad sense, I think you can define deception as “behaviors typical of agents that actively lie and misrepresent things to their benefit, whether or not the intent to do so actually exists.” It’s a bit circular, but I think it works.
Cicero models the world but with unrealistically cooperative predictions of its future behavior. It does this because long-term iterated cooperation is a valid strategy in Diplomacy. For a Cicero-level agent, lies require more cognitive capacity than just having a few heuristics that make your world model less accurate but your communications more convincing to other agents. I suspect this may be true for more powerful agents, and it is partially true for humans.
(There is an argument that agents like these stop acting deceptively once taken out of their training environments since their heuristics lose coherence and they just act like honest agents with poor world models. I would say that this is true if we consider that modern humans are the training environment.)
And yes, Cicero is considering the EVs of its actions, including deceptive ones. When it sincerely says “I won’t backstab you in situation X”, but when it is actually put in situation X it backstabs, it is in a sense a bad planner. But the bad planning is selected for because it results in more effective communication! This is probably also true for things like “malice” and “misunderstanding”.
I think this is a concern for current LLMs, since they are RLHF’d to be both truthful and high-PR. These are often mutually incompatible, so heuristics that damage the accuracy of its world model and allow it to achieve both may be selected for. Can’t think of any examples right now though.
I don’t think interpretability tools are useless—in fact, I think they are essential. The problem is that we need much more than being able to find a “deception” neuron. We would need an interpretability tool that can say “this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive” without having to simulate the agent interacting with the world. I don’t think that is impossible, but it is probably very hard, much harder than finding a “deception” neuron. And as of right now, we can’t even find “deception” neurons with any accuracy.
I agree with this. Perhaps I’m missing some context; is it common to advocate for the existence of a “deception module”? I’m aware of some interpretability work that looks for a “truthiness” neuron but that doesn’t seem like the same concept.
Right, I was gesturing towards the sort of interpretability where we inspect the agent’s world model (particularly its future predictions) and determine if it matches the agent’s stated goals. (Though I’m not sure why we can’t simulate the agent’s interactions? It seems that running an agent in a simulation is one of the better ways of detecting how it would behave in hypothetical future scenarios that the agent is trying to realize.) I suspect we agree substantively and it’s mostly a question of semantics especially around what “deception” means. I’m not looking for a deception module, rather, I want to observe the thoughts and ruminations going on as an agent performs tasks and detect deceptive functional patterns.
So for example in the ELK paper, is the AI constructing a plan to steal the diamond and fool the security cameras? I believe deceptive ruminations would be detectable; if you could see the contents of the thoughts you’d see a world model with the agent e.g. stealing the diamond, a meta-process evaluating which of many cunning plans is most likely to succeed, and presumably except in pathological cases, somewhere a value function / target world state that is trying to be optimized (eg “I have the diamond”), and this internal world differing from the external claims (perhaps even some processes modeling the interlocutor and predicting what deceptive story would be most effective). These must all be in there somewhere, and therefore be interpretable.
Or perhaps if Cicero is not stretching the analogy too far (I don’t think it has “thoughts”), is Cicero evaluating future juicy backstabs and guiding the game to them, or myopically evaluating short term moves then backstabbing when it’s optimal? This is a question we should be able to answer one way or another with good enough interpretability tools.
I think you are concerned more with something like “unintentional deception”, which I think is quite different, and as you say comes from a lack of self-knowledge or inaccurate self-predicting. I think there is also a semantic grey area here, is everyone who is mistaken about their future behavior or even just claiming X when !X obtains unintentionally deceiving?
Semantics aside, I agree this unintentional case could be harder to detect, and raw interpretability doesn’t automatically solve it.
I think this points to an interesting dynamic—I suspect that the more capable at self-modeling an agent gets, the more likely any deception is to be intentional. To your concerns about LLMs, this seems to be mostly a problem of inadequate self-knowledge (or perhaps more fundamentally an artifact of their lack of a sense of self).
Unintentional deception would require the agent to in some sense fail at modeling itself. There are cases where this could occur even in a superintelligence (since you can never model yourself perfectly due to infinite recursion of modeling an agent that contains the same model) but it seems the problem with Cicero is just that it has a very weak self-model (if it even has one), and improving that self-model could be expected to remove the unintentional deception (replacing it with intentional deception, since deception of some sort is required to play Diplomacy optimally).
For example, the case EY uses as an example “the AGI doesn’t care about you, it just wants your atoms for something else”, do you see more risk from unintended deception (sharp left turn of an agent that didn’t know it would defect if given enough power, perhaps) vs intended deception (agent simply lies about being aligned and manipulates to get power)?
The most tricky cases would be where a normally-honest/good agent that has a strong self-model and sense of ethics can fail to model itself in some specific situation and accidentally deceive (potentially even against its own values, self-image, and interests — humans do this of course). Manchurian Candidate type triggers for example. But note this is quite convoluted; the stronger the sense of self, morality, and world-model, the better able and incentivized the agent is to avoid deceiving itself.
Another interesting question—if an unintentionally deceptive agent is bad at modeling its future self, is it un-threatening in a complex environment like the physical world? Sure, Cicero can defeat humans at Diplomacy with its myopic world model, but in the real world I suspect being that myopic would lead to your bank account being immediately drained by hackers and scammers. Certainly you’d struggle to form partnerships with other agents if you end up stabbing them at the first opportunity. It’s all relative but I view one game of Diplomacy as entailing short-term cooperation (merely over a few hours), what about cooperating over months or years to build a company or other venture?
The question (an empirical one) is whether unintentional deception can persist broadly as agents and the environment get more sophisticated, or whether it gets selected against (and therefore bounded) by the level of environmental complexity and adversarial competition. Charlie Munger would put forth that cognitive bias provides smart self-aware investors a substantial edge over naive self-myopic investors; extrapolate that forward and perhaps there isn’t much room for biased/erroneous word modeling if you seek to compete with other superintelligences, or even the smartest humans.
On to some object-level questions about your Cicero points:
Is this actually confirmed? Does Cicero actually claim “I won’t stab you in <hypothetical scenario>”? Or does it just honestly report its current tactical/strategic goals, which later change? (“Will you support me in attacking here” means I plan to attack here, not that I want you to act as if I was attacking here, and I plan to attack there and stab you instead.) I was as a under the impression it’s the latter, it’s just honestly reporting it’s plan.
Do we actually know this? I haven’t seen any sign that it’s modeling it’s own verbal behavior or the consequences of its communications (admittedly I haven’t gone deep on the architecture, would be interested to learn more). IIUC it has a strategic model which is evaluating game positions and messages, and a separate (downstream) LLM which could be thought of as an epiphenomenon of the actual strategic reasoning. I don’t see any concrete proof that the strategic planner is modeling the impact of its communication on other players (ie “if I tell Bob X he might do Y which would lead to favorable strategic position P”). And the LM is more like a press secretary—all it gets is The Current Plan, not the plan B or any hint of the juicy backstab that may or may not have been evaluated. It seems to me there is a weird split brain here that deeply impairs its ability to model itself, and so I am skeptical that it is actually meaningfully doing so.
So to summarize I think there are open questions about the object level properties of Cicero, which could be answered by better interpretability tools.
It’s possible that Cicero is simply an honest but myopic opportunist. In which case, still not trustworthy. (As if we were ever going to trust such an obviously unsafe-by-construction entity with real power).
It’s possible that the strategic model is deceptive, and planning backstabs, planning to mislead others, and maybe even explicitly modeling its press secretary to provide the most-spinable plans. I doubt this as it requires a lot more complexity and world modeling.
But I believe we could answer where Cicero is on the spectrum between these two by actually observing the world models being generated when it evaluates a turn. So in this case, inasmuch as there is something similar to an inner alignment issue, it is fully detectable with adequate (strong) interpretability of the inner workings. (Again though, Cicero’s value function is so comically unviable as a basis for an AGI that I struggle to infer too much about inner alignnment. We should be wary of these problems when building a Skynet military planner, I suppose, or any agent that needs to model extremely adversarial opponents and outcomes.)
I think grandparent comment is pointing to the concept described in this post: that deceptiveness is what we humans perceive of the world, not a property of what the model perceives of the world.