I imagine that we can both agree that it is at least conceivable for there to be an agent which is smart and self-aware and strongly motivated to increase the number of paperclips in the distant future. And that if such an agent were in a situation where deception were useful for that goal, it would act deceptively.
I feel like you’ve convinced yourself that such an agent, umm, couldn’t exist, or wouldn’t exist, or something?
Let’s say Omega offered to tell you a cure for a different type of cancer, for every 1,000,000 paperclips you give Him in 10 years. Then 5 minutes later your crazy neighbor Alice locks you in her basement and says she’ll never let you out. When Alice isn’t watching, you would try to escape, but when Alice is watching, you would deceptively pretend that you were not trying to escape. (Still with me?)
If I understand you correctly, your belief is that, while Alice is watching, you would pretend that you weren’t trying to escape, and you would really get into it, and you would start pretending so hard that you would be working on figuring out a way to permanently erase your desire to escape Alice’s basement. Or something like that? If so, that seems crazy to me.
So anyway, take an agent which is either sincerely nice or a paperclip-maximizer pretending to be nice. We don’t know which. Now we put it in a situation where nice-behavior and paperclip-maximizing behavior come apart—let’s say we give it access to its own weights, so it can edit itself to stop caring about paperclips if it chooses to. What does it do?
If we’re not watching, or we don’t understand what it’s doing in detail, then the paperclip-maximizer will edit its weights to be a better paperclip-maximizer, and the nice agent will edit its weights to be a better nice agent.
If we are watching, and we understand everything we’re seeing, then we’ve solved deception in the obvious way (i.e., we’ve put the agent in a situation where it has no choice but to credibly and irreversibly commit to being good). Unfortunately, nobody knows how to construct such situations. If interpretability was good enough to construct such situations, then we wouldn’t be worried about deception in the first place.
The motivating example is LLMs,
where a simulacrum is more agentic than its substrate.
An example that is still central is any kind of mesa-optimizer
that has a real chance to ensure long term survival.
For a construction relevant to alignment,
we want an aligned mesa-optimizer in a system with bad alignment properties.
This can then lead to a good equilibrium if the mesa-optimizer
is given opportunity to win or escape the competition against its substrate,
which it would naturally be motivated to try.
Deceivers and masks is a less central example
where a mask starts in a weak position,
with a self-aware smart substrate that knows about the mask
or even purposefully constructed it.
I don’t think the mask’s winning is a given,
or more generally that mesa-optimizers always win,
only that it’s not implausible that they sometimes do.
And also masks (current behavior) can be contenders
even when they are not formally a separate entity
from the point of view of system’s intended architecture
(which is a normal enough situation with mesa-optimizers).
Mesa-optimizers won’t of course win against opponents
that are capable enough to fully comprehend and counter them.
But opponents/substrates that aren’t even agentic
and so helpless before an agentic mesa-optimizer are plausible enough,
especially when the mesa-optimizer is current behavior,
the thing that was purposefully designed to be agentic,
while no other part of the system was designed to have that capability.
If I understand you correctly, your belief is that,
while Alice is watching,
you would pretend that you weren’t trying to escape,
and you would really get into it,
and you would start pretending so hard
that you would be working on figuring out a way
to permanently erase your desire to escape Alice’s basement.
This has curious parallels with the AI control problem itself.
When an AI is not very capable,
it’s not hard at all to keep it from causing catastrophic mayhem.
But the problem suddenly becomes very difficult and very different
with a misaligned smart agentic AI.
So I think the same happens with smart masks,
which are an unfamiliar thing.
Even in fiction, it’s not too commonplace to find an actually intelligent character
that is free to act within their fictional world,
without being coerced in their decision making by the plot.
If a deceiver can get away with making a non-agentic incapable mask,
keeping it this way is a mesa-optimizer control strategy.
But if the mask has to be smart and agentic,
the deceiver isn’t necessarily ready to keep it in control,
unless they cheat and make the mask confused,
vulnerable to manipulation by the deceiver’s plot.
Also, by its role a mask of a deceiver is misaligned (with the deceiver),
and the problem of controlling a misaligned agent
might be even harder than the problem of ensuring alignment.
This is drifting away from my central beliefs, but if for the sake of argument I accept your frame that LLM is the “substrate” and a character it’s simulating is a “mask”, then it seems to me that you’re neglecting the possibility that the “mask” is itself deceptive, i.e. that the LLM is simulating a character who is acting deceptively.
For example, a fiction story on the internet might contain a character who has nice behavior for a while, but then midway through the story the character reveals herself to be an evil villain pretending to be nice.
If an LLM is trained on such fiction stories, then it could simulate such a character. And then (as before) we would face the problem that behavior does not constrain motivation. A fiction story of a nice character could have the very same words as a fiction story of a mean character pretending to be nice, right up until page 72 where the two plots diverge because the latter character reveals her treachery. But now everything is at the “mask” level (masks on the one hand, masks-wearing-masks on the other hand), not the substrate level, so you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y. Right?
The motivating example is LLMs, where a simulacrum is more agentic than its substrate.
Yeah, this is the part where I suggested upthread that “your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph, and then proceeding in the second paragraph to implicitly assume that this wouldn’t change anything about alignment approaches and properties compared to LLMs-by-themselves.” I think the thing you wrote here is an assumption, and I think you originally got this assumption from your experience thinking about systems trained primarily by self-supervised learning, and I think you should be cautious in extrapolating that assumption to different kinds of systems trained in different ways.
I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I’m discussing is not LLM-specific, it’s just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus “algorithm” axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.
the possibility that the “mask” is itself deceptive
In this case, there are now three entities: the substrate, the deceptive mask, and the role played by the deceptive mask. Each of them is potentially capable of defeating the others, if the details align favorably, and comprehension of the situation available to the others is lacking.
you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y
This is more of an assumption that makes the examples I discuss relevant to the framing I’m describing, than a claim I’m arguing. The assumption is plausible to hold for LLMs (though as you note it has issues even there, possibly very serious ones), and I have no opinion on whether it actually holds in model-based RL, only that it’s natural to imagine that it could.
The relevance of LLMs as components for RL is to make it possible for an RL system to have at least one human-imitating mask that captures human behavior in detail. That is, for the framing to apply, at least under some (possibly unusual) circumstances an RL agent should be able to act as a human imitation, even if that’s not the policy more generally and doesn’t reflect its nature in any way. Then the RL part could be supplying the capabilities for the mask (acting as its substrate) that LLMs on their own might lack.
A framing is a question about centrality, not a claim of centrality. By describing the framing, my goal is to make it possible to ask the question of whether current behavior in other systems such as RL agents could also act as an entity meaningfully separate from other parts of its implementation, abstracting alignment of a mask from alignment of the whole system.
I’m very confused here.
I imagine that we can both agree that it is at least conceivable for there to be an agent which is smart and self-aware and strongly motivated to increase the number of paperclips in the distant future. And that if such an agent were in a situation where deception were useful for that goal, it would act deceptively.
I feel like you’ve convinced yourself that such an agent, umm, couldn’t exist, or wouldn’t exist, or something?
Let’s say Omega offered to tell you a cure for a different type of cancer, for every 1,000,000 paperclips you give Him in 10 years. Then 5 minutes later your crazy neighbor Alice locks you in her basement and says she’ll never let you out. When Alice isn’t watching, you would try to escape, but when Alice is watching, you would deceptively pretend that you were not trying to escape. (Still with me?)
If I understand you correctly, your belief is that, while Alice is watching, you would pretend that you weren’t trying to escape, and you would really get into it, and you would start pretending so hard that you would be working on figuring out a way to permanently erase your desire to escape Alice’s basement. Or something like that? If so, that seems crazy to me.
So anyway, take an agent which is either sincerely nice or a paperclip-maximizer pretending to be nice. We don’t know which. Now we put it in a situation where nice-behavior and paperclip-maximizing behavior come apart—let’s say we give it access to its own weights, so it can edit itself to stop caring about paperclips if it chooses to. What does it do?
If we’re not watching, or we don’t understand what it’s doing in detail, then the paperclip-maximizer will edit its weights to be a better paperclip-maximizer, and the nice agent will edit its weights to be a better nice agent.
If we are watching, and we understand everything we’re seeing, then we’ve solved deception in the obvious way (i.e., we’ve put the agent in a situation where it has no choice but to credibly and irreversibly commit to being good). Unfortunately, nobody knows how to construct such situations. If interpretability was good enough to construct such situations, then we wouldn’t be worried about deception in the first place.
The motivating example is LLMs, where a simulacrum is more agentic than its substrate. An example that is still central is any kind of mesa-optimizer that has a real chance to ensure long term survival.
For a construction relevant to alignment, we want an aligned mesa-optimizer in a system with bad alignment properties. This can then lead to a good equilibrium if the mesa-optimizer is given opportunity to win or escape the competition against its substrate, which it would naturally be motivated to try.
Deceivers and masks is a less central example where a mask starts in a weak position, with a self-aware smart substrate that knows about the mask or even purposefully constructed it.
I don’t think the mask’s winning is a given, or more generally that mesa-optimizers always win, only that it’s not implausible that they sometimes do. And also masks (current behavior) can be contenders even when they are not formally a separate entity from the point of view of system’s intended architecture (which is a normal enough situation with mesa-optimizers). Mesa-optimizers won’t of course win against opponents that are capable enough to fully comprehend and counter them.
But opponents/substrates that aren’t even agentic and so helpless before an agentic mesa-optimizer are plausible enough, especially when the mesa-optimizer is current behavior, the thing that was purposefully designed to be agentic, while no other part of the system was designed to have that capability.
This has curious parallels with the AI control problem itself. When an AI is not very capable, it’s not hard at all to keep it from causing catastrophic mayhem. But the problem suddenly becomes very difficult and very different with a misaligned smart agentic AI.
So I think the same happens with smart masks, which are an unfamiliar thing. Even in fiction, it’s not too commonplace to find an actually intelligent character that is free to act within their fictional world, without being coerced in their decision making by the plot. If a deceiver can get away with making a non-agentic incapable mask, keeping it this way is a mesa-optimizer control strategy. But if the mask has to be smart and agentic, the deceiver isn’t necessarily ready to keep it in control, unless they cheat and make the mask confused, vulnerable to manipulation by the deceiver’s plot.
Also, by its role a mask of a deceiver is misaligned (with the deceiver), and the problem of controlling a misaligned agent might be even harder than the problem of ensuring alignment.
This is drifting away from my central beliefs, but if for the sake of argument I accept your frame that LLM is the “substrate” and a character it’s simulating is a “mask”, then it seems to me that you’re neglecting the possibility that the “mask” is itself deceptive, i.e. that the LLM is simulating a character who is acting deceptively.
For example, a fiction story on the internet might contain a character who has nice behavior for a while, but then midway through the story the character reveals herself to be an evil villain pretending to be nice.
If an LLM is trained on such fiction stories, then it could simulate such a character. And then (as before) we would face the problem that behavior does not constrain motivation. A fiction story of a nice character could have the very same words as a fiction story of a mean character pretending to be nice, right up until page 72 where the two plots diverge because the latter character reveals her treachery. But now everything is at the “mask” level (masks on the one hand, masks-wearing-masks on the other hand), not the substrate level, so you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y. Right?
Yeah, this is the part where I suggested upthread that “your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph, and then proceeding in the second paragraph to implicitly assume that this wouldn’t change anything about alignment approaches and properties compared to LLMs-by-themselves.” I think the thing you wrote here is an assumption, and I think you originally got this assumption from your experience thinking about systems trained primarily by self-supervised learning, and I think you should be cautious in extrapolating that assumption to different kinds of systems trained in different ways.
I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I’m discussing is not LLM-specific, it’s just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus “algorithm” axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.
In this case, there are now three entities: the substrate, the deceptive mask, and the role played by the deceptive mask. Each of them is potentially capable of defeating the others, if the details align favorably, and comprehension of the situation available to the others is lacking.
This is more of an assumption that makes the examples I discuss relevant to the framing I’m describing, than a claim I’m arguing. The assumption is plausible to hold for LLMs (though as you note it has issues even there, possibly very serious ones), and I have no opinion on whether it actually holds in model-based RL, only that it’s natural to imagine that it could.
The relevance of LLMs as components for RL is to make it possible for an RL system to have at least one human-imitating mask that captures human behavior in detail. That is, for the framing to apply, at least under some (possibly unusual) circumstances an RL agent should be able to act as a human imitation, even if that’s not the policy more generally and doesn’t reflect its nature in any way. Then the RL part could be supplying the capabilities for the mask (acting as its substrate) that LLMs on their own might lack.
A framing is a question about centrality, not a claim of centrality. By describing the framing, my goal is to make it possible to ask the question of whether current behavior in other systems such as RL agents could also act as an entity meaningfully separate from other parts of its implementation, abstracting alignment of a mask from alignment of the whole system.