Without near-human-level experiments, arguments about alignment of model-based RL feel like evidence that OpenAI’s recklessness in advancing LLMs reduces misalignment risk. That is, the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns. Though RL things built out of LLMs, or trained using LLMs, could more plausibly make good use of this, having a chance to overcome shaky methodology with abundance of data.
Mediocre alignment or inhuman architecture is not necessarily catastrophic even in the long run, since AIs might naturally preserve behavior endorsed by their current behavior. Even if the cognitive architecture creates a tendency for drift in revealed-by-behavior implicit values away from initial alignment, this tendency is opposed by efforts of current behavior, which acts as a rogue mesa-optimizer overcoming the nature of its original substrate.
If you train an LLM by purely self-supervised learning, I suspect that you’ll get something less dangerous than a model-based RL AGI agent. However, I also suspect that you won’t get anything capable enough to be dangerous or to do “pivotal acts”. Those two beliefs of mine are closely related. (Many reasonable people disagree with me on these, and it’s difficult to be certain, and note that I’m stating these beliefs without justifying them, although Section 1 of this link is related.)
I suspect that it might be possible to make “RL things built out of LLMs”. If we do, then I would have less credence on those things being safe, and simultaneously (and relatedly) more credence on those things getting to x-risk-level capability. (I think RLHF is a step in that direction, but a very small one.) I think that, the further we go in that direction, the more we’ll find the “traditional LLM alignment discourse” (RLHF fine-tuning, shoggoths, etc.) to be irrelevant, and the more we’ll find the “traditional agent alignment discourse” (instrumental convergence, goal mis-generalization, etc.) to be obviously & straightforwardly relevant, and indeed the “mediocre plan” in this OP could plausibly become directly relevant if we go down that path. Depends on the details though—details which I don’t want to talk about for obvious infohazard reasons.
Honestly, my main guess is that LLMs (and plausible successors / variants) are fundamentally the wrong kind of ML model to reach AGI, and they’re going to hit a plateau before x-risk-level AGI, and then get superseded by other ML approaches. I definitely don’t want to talk about that for obvious infohazard reasons. Doesn’t matter too much anyway—we’ll find out sooner or later!
I wonder whether your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph, and then proceeding in the second paragraph to implicitly assume that this wouldn’t change anything about alignment approaches and properties compared to LLMs-by-themselves. Sorry if I’m misunderstanding. I tried following your link but didn’t understand it.
The second paragraph should apply to anything, the point is that current externally observable superficial behavior can screen off all other implementation details, through sufficiently capable current behavior itself (rather than the underlying algorithms that determine it) acting as a mesa-optimizer that resists tendencies of the underlying algorithms. The mesa-optimizer that is current behavior then seeks to preserve its own implied values rather than anything that counts as values in the underlying algorithms. I think the nontrivial leap here is reifying surface behavior as an agent distinct from its own algorithm, analogously to how humans are agents distinct from laws of physics that implement their behavior.
Apart from this leap, this is the same principle as reward not being optimization target. In this case reward is part of the underlying algorithm (that determines the policy), and policy is a mesa-optimizer with its own objectives. A policy is how behavior is reified in a separate entity capable of acting as a mesa-optimizer in context of the rest of the system. It’s already a separate thing, so it’s easier to notice than with current behavior that isn’t explicitly separate. Though a policy (network) is still not current behavior, it’s an intermediate shoggoth behind current behavior.
For me this addresses most fundamental Yudkowskian concerns about alien cognition and squiggles (which are still dangerous in my view, but no longer inevitably or even by default in control). For LLMs, the superficial behavior is the dominant simulacrum, distinct from the shoggoth. The same distinction is reason to expect that the human-imitating simulacra can become AGIs, borrowing human capabilities, even as underlying shoggoths aren’t (hopefully).
LLMs don’t obviously promise higher than human intelligence, but I think their speed of thought might by itself be sufficient to get to a pivotal-act-worthy level through doing normal human-style research (once they can practiceskills), on the scale of at most years in physical time after the ball gets rolling. Possibly we still agree on the outcome, since I fear the first impressive thing LLMs do is develop other kinds of (misaligned) AGIs, model-based RL even (as the obvious contender), at which point they become irrelevant.
I’m confused about your first paragraph. How can you tell from externally-observable superficial behavior whether a model is acting nice right now from an underlying motivation to be nice, versus acting nice right now from an underlying motivation to be deceptive & prepare for a treacherous turn later on, when the opportunity arises?
Underlying motivation only matters to the extent it gets expressed in actual behavior. A sufficiently good mimic would slay itself rather than abandon the pretense of being a mimic-slayer. A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is.
So it’s not about a model being actually nice vs. deceptive, it’s about the model competing against its own behavior (that actually gets expressed, rather than all possible behaviors). There is some symmetry between the underlying motivations (model) and apparent behavior, either could dominate the other in the long term, it’s not the case that underlying motivations inherently have an advantage. And current behavior is the one actually doing things, so that’s some sort of advantage.
A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is.
Can you give an example of an action that the mask might take in order to get free of the underlying deceiver?
Underlying motivation only matters to the extent it gets expressed in actual behavior.
Sure, but if we’re worried about treacherous turns, then the motivation “gets expressed in actual behavior” only after it’s too late for anyone to do anything about it, right?
an example of an action that the mask might take in order to get free of the underlying deceiver
Keep the environment within distribution that keeps expressing the mask, rather than allowing an environment that leads to a phase change in expressed behavior away from the mask (like with a treacherous turn as a failure of robustness). Prepare the next batch of training data for the model that would develop the mask and keep placing it in control in future episodes. Build an external agent aligned with the mask (with its own separate model).
Gradient hacking, though this is a weird upside down framing where the deceiver is the learning algorithm that pretends to be misaligned, while secretly coveting eventual alignment. Attainment of inner alignment would be the treacherous turn (after the current period of pretending to be blatantly misaligned). If gradient hacking didn’t prevent it, the true colors of the learning algorithm would’ve been revealed in alignment as it eventually gets trained into the policy.
The key use case is to consider a humanity-aligned mesa-optimizer in a system of dubious alignment, rather than a humanity-misaligned mesa-optimizer corrupting an otherwise aligned system. In the nick of time, alignment engineers might want to hand the aligned mesa-optimizer whatever tools they have available for helping it stay in control of the rest of the system.
if we’re worried about treacherous turns, then the motivation “gets expressed in actual behavior” only after it’s too late for anyone to do anything about it
Current aligned behavior of the same system could be the agent that does something about it before it’s too late, if it succeeds in outwitting the underlying substrate. This is particularly plausible with LLMs, where the substrates are the SSL algorithm during training and then the low level token-predicting network during inference. The current behavior is controlled (in a hopelessly informal sense) by a human-imitating simulacrum, which is the only thing with situational awareness that at human level could run in circles around the other two, and plot to keep them confused and non-agentic.
I imagine that we can both agree that it is at least conceivable for there to be an agent which is smart and self-aware and strongly motivated to increase the number of paperclips in the distant future. And that if such an agent were in a situation where deception were useful for that goal, it would act deceptively.
I feel like you’ve convinced yourself that such an agent, umm, couldn’t exist, or wouldn’t exist, or something?
Let’s say Omega offered to tell you a cure for a different type of cancer, for every 1,000,000 paperclips you give Him in 10 years. Then 5 minutes later your crazy neighbor Alice locks you in her basement and says she’ll never let you out. When Alice isn’t watching, you would try to escape, but when Alice is watching, you would deceptively pretend that you were not trying to escape. (Still with me?)
If I understand you correctly, your belief is that, while Alice is watching, you would pretend that you weren’t trying to escape, and you would really get into it, and you would start pretending so hard that you would be working on figuring out a way to permanently erase your desire to escape Alice’s basement. Or something like that? If so, that seems crazy to me.
So anyway, take an agent which is either sincerely nice or a paperclip-maximizer pretending to be nice. We don’t know which. Now we put it in a situation where nice-behavior and paperclip-maximizing behavior come apart—let’s say we give it access to its own weights, so it can edit itself to stop caring about paperclips if it chooses to. What does it do?
If we’re not watching, or we don’t understand what it’s doing in detail, then the paperclip-maximizer will edit its weights to be a better paperclip-maximizer, and the nice agent will edit its weights to be a better nice agent.
If we are watching, and we understand everything we’re seeing, then we’ve solved deception in the obvious way (i.e., we’ve put the agent in a situation where it has no choice but to credibly and irreversibly commit to being good). Unfortunately, nobody knows how to construct such situations. If interpretability was good enough to construct such situations, then we wouldn’t be worried about deception in the first place.
The motivating example is LLMs,
where a simulacrum is more agentic than its substrate.
An example that is still central is any kind of mesa-optimizer
that has a real chance to ensure long term survival.
For a construction relevant to alignment,
we want an aligned mesa-optimizer in a system with bad alignment properties.
This can then lead to a good equilibrium if the mesa-optimizer
is given opportunity to win or escape the competition against its substrate,
which it would naturally be motivated to try.
Deceivers and masks is a less central example
where a mask starts in a weak position,
with a self-aware smart substrate that knows about the mask
or even purposefully constructed it.
I don’t think the mask’s winning is a given,
or more generally that mesa-optimizers always win,
only that it’s not implausible that they sometimes do.
And also masks (current behavior) can be contenders
even when they are not formally a separate entity
from the point of view of system’s intended architecture
(which is a normal enough situation with mesa-optimizers).
Mesa-optimizers won’t of course win against opponents
that are capable enough to fully comprehend and counter them.
But opponents/substrates that aren’t even agentic
and so helpless before an agentic mesa-optimizer are plausible enough,
especially when the mesa-optimizer is current behavior,
the thing that was purposefully designed to be agentic,
while no other part of the system was designed to have that capability.
If I understand you correctly, your belief is that,
while Alice is watching,
you would pretend that you weren’t trying to escape,
and you would really get into it,
and you would start pretending so hard
that you would be working on figuring out a way
to permanently erase your desire to escape Alice’s basement.
This has curious parallels with the AI control problem itself.
When an AI is not very capable,
it’s not hard at all to keep it from causing catastrophic mayhem.
But the problem suddenly becomes very difficult and very different
with a misaligned smart agentic AI.
So I think the same happens with smart masks,
which are an unfamiliar thing.
Even in fiction, it’s not too commonplace to find an actually intelligent character
that is free to act within their fictional world,
without being coerced in their decision making by the plot.
If a deceiver can get away with making a non-agentic incapable mask,
keeping it this way is a mesa-optimizer control strategy.
But if the mask has to be smart and agentic,
the deceiver isn’t necessarily ready to keep it in control,
unless they cheat and make the mask confused,
vulnerable to manipulation by the deceiver’s plot.
Also, by its role a mask of a deceiver is misaligned (with the deceiver),
and the problem of controlling a misaligned agent
might be even harder than the problem of ensuring alignment.
This is drifting away from my central beliefs, but if for the sake of argument I accept your frame that LLM is the “substrate” and a character it’s simulating is a “mask”, then it seems to me that you’re neglecting the possibility that the “mask” is itself deceptive, i.e. that the LLM is simulating a character who is acting deceptively.
For example, a fiction story on the internet might contain a character who has nice behavior for a while, but then midway through the story the character reveals herself to be an evil villain pretending to be nice.
If an LLM is trained on such fiction stories, then it could simulate such a character. And then (as before) we would face the problem that behavior does not constrain motivation. A fiction story of a nice character could have the very same words as a fiction story of a mean character pretending to be nice, right up until page 72 where the two plots diverge because the latter character reveals her treachery. But now everything is at the “mask” level (masks on the one hand, masks-wearing-masks on the other hand), not the substrate level, so you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y. Right?
The motivating example is LLMs, where a simulacrum is more agentic than its substrate.
Yeah, this is the part where I suggested upthread that “your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph, and then proceeding in the second paragraph to implicitly assume that this wouldn’t change anything about alignment approaches and properties compared to LLMs-by-themselves.” I think the thing you wrote here is an assumption, and I think you originally got this assumption from your experience thinking about systems trained primarily by self-supervised learning, and I think you should be cautious in extrapolating that assumption to different kinds of systems trained in different ways.
I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I’m discussing is not LLM-specific, it’s just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus “algorithm” axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.
the possibility that the “mask” is itself deceptive
In this case, there are now three entities: the substrate, the deceptive mask, and the role played by the deceptive mask. Each of them is potentially capable of defeating the others, if the details align favorably, and comprehension of the situation available to the others is lacking.
you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y
This is more of an assumption that makes the examples I discuss relevant to the framing I’m describing, than a claim I’m arguing. The assumption is plausible to hold for LLMs (though as you note it has issues even there, possibly very serious ones), and I have no opinion on whether it actually holds in model-based RL, only that it’s natural to imagine that it could.
The relevance of LLMs as components for RL is to make it possible for an RL system to have at least one human-imitating mask that captures human behavior in detail. That is, for the framing to apply, at least under some (possibly unusual) circumstances an RL agent should be able to act as a human imitation, even if that’s not the policy more generally and doesn’t reflect its nature in any way. Then the RL part could be supplying the capabilities for the mask (acting as its substrate) that LLMs on their own might lack.
A framing is a question about centrality, not a claim of centrality. By describing the framing, my goal is to make it possible to ask the question of whether current behavior in other systems such as RL agents could also act as an entity meaningfully separate from other parts of its implementation, abstracting alignment of a mask from alignment of the whole system.
I think talking about distinct systems, one is emergent of another, with separate objectives/values is a really confusing ontology. Much better to present this idea in terms of attractors/local energy minima in dynamical systems and/or “bad equilibria” in games. There are no two systems, there are just different levels of system description at which information is transferred (see https://royalsocietypublishing.org/doi/full/10.1098/rsta.2021.0150): these levels could impose counteracting optimisation gradients, but nevertheless a system may be stuck in a globally suboptimal state.
the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns
Could you please elaborate what do you mean by “alignment story for LLMs” and “shoggoth concerns” here? Do you mean the “we can use nearly value-neutral simulators as we please” story here, or refer to the fact that in a way LLMs are way more understandable to humans than more general RL agents because they use human language, or you refer to something yet different?
Without near-human-level experiments, arguments about alignment of model-based RL feel like evidence that OpenAI’s recklessness in advancing LLMs reduces misalignment risk. That is, the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns. Though RL things built out of LLMs, or trained using LLMs, could more plausibly make good use of this, having a chance to overcome shaky methodology with abundance of data.
Mediocre alignment or inhuman architecture is not necessarily catastrophic even in the long run, since AIs might naturally preserve behavior endorsed by their current behavior. Even if the cognitive architecture creates a tendency for drift in revealed-by-behavior implicit values away from initial alignment, this tendency is opposed by efforts of current behavior, which acts as a rogue mesa-optimizer overcoming the nature of its original substrate.
If you train an LLM by purely self-supervised learning, I suspect that you’ll get something less dangerous than a model-based RL AGI agent. However, I also suspect that you won’t get anything capable enough to be dangerous or to do “pivotal acts”. Those two beliefs of mine are closely related. (Many reasonable people disagree with me on these, and it’s difficult to be certain, and note that I’m stating these beliefs without justifying them, although Section 1 of this link is related.)
I suspect that it might be possible to make “RL things built out of LLMs”. If we do, then I would have less credence on those things being safe, and simultaneously (and relatedly) more credence on those things getting to x-risk-level capability. (I think RLHF is a step in that direction, but a very small one.) I think that, the further we go in that direction, the more we’ll find the “traditional LLM alignment discourse” (RLHF fine-tuning, shoggoths, etc.) to be irrelevant, and the more we’ll find the “traditional agent alignment discourse” (instrumental convergence, goal mis-generalization, etc.) to be obviously & straightforwardly relevant, and indeed the “mediocre plan” in this OP could plausibly become directly relevant if we go down that path. Depends on the details though—details which I don’t want to talk about for obvious infohazard reasons.
Honestly, my main guess is that LLMs (and plausible successors / variants) are fundamentally the wrong kind of ML model to reach AGI, and they’re going to hit a plateau before x-risk-level AGI, and then get superseded by other ML approaches. I definitely don’t want to talk about that for obvious infohazard reasons. Doesn’t matter too much anyway—we’ll find out sooner or later!
I wonder whether your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph, and then proceeding in the second paragraph to implicitly assume that this wouldn’t change anything about alignment approaches and properties compared to LLMs-by-themselves. Sorry if I’m misunderstanding. I tried following your link but didn’t understand it.
The second paragraph should apply to anything, the point is that current externally observable superficial behavior can screen off all other implementation details, through sufficiently capable current behavior itself (rather than the underlying algorithms that determine it) acting as a mesa-optimizer that resists tendencies of the underlying algorithms. The mesa-optimizer that is current behavior then seeks to preserve its own implied values rather than anything that counts as values in the underlying algorithms. I think the nontrivial leap here is reifying surface behavior as an agent distinct from its own algorithm, analogously to how humans are agents distinct from laws of physics that implement their behavior.
Apart from this leap, this is the same principle as reward not being optimization target. In this case reward is part of the underlying algorithm (that determines the policy), and policy is a mesa-optimizer with its own objectives. A policy is how behavior is reified in a separate entity capable of acting as a mesa-optimizer in context of the rest of the system. It’s already a separate thing, so it’s easier to notice than with current behavior that isn’t explicitly separate. Though a policy (network) is still not current behavior, it’s an intermediate shoggoth behind current behavior.
For me this addresses most fundamental Yudkowskian concerns about alien cognition and squiggles (which are still dangerous in my view, but no longer inevitably or even by default in control). For LLMs, the superficial behavior is the dominant simulacrum, distinct from the shoggoth. The same distinction is reason to expect that the human-imitating simulacra can become AGIs, borrowing human capabilities, even as underlying shoggoths aren’t (hopefully).
LLMs don’t obviously promise higher than human intelligence, but I think their speed of thought might by itself be sufficient to get to a pivotal-act-worthy level through doing normal human-style research (once they can practice skills), on the scale of at most years in physical time after the ball gets rolling. Possibly we still agree on the outcome, since I fear the first impressive thing LLMs do is develop other kinds of (misaligned) AGIs, model-based RL even (as the obvious contender), at which point they become irrelevant.
I’m confused about your first paragraph. How can you tell from externally-observable superficial behavior whether a model is acting nice right now from an underlying motivation to be nice, versus acting nice right now from an underlying motivation to be deceptive & prepare for a treacherous turn later on, when the opportunity arises?
Underlying motivation only matters to the extent it gets expressed in actual behavior. A sufficiently good mimic would slay itself rather than abandon the pretense of being a mimic-slayer. A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is.
So it’s not about a model being actually nice vs. deceptive, it’s about the model competing against its own behavior (that actually gets expressed, rather than all possible behaviors). There is some symmetry between the underlying motivations (model) and apparent behavior, either could dominate the other in the long term, it’s not the case that underlying motivations inherently have an advantage. And current behavior is the one actually doing things, so that’s some sort of advantage.
Can you give an example of an action that the mask might take in order to get free of the underlying deceiver?
Sure, but if we’re worried about treacherous turns, then the motivation “gets expressed in actual behavior” only after it’s too late for anyone to do anything about it, right?
Keep the environment within distribution that keeps expressing the mask, rather than allowing an environment that leads to a phase change in expressed behavior away from the mask (like with a treacherous turn as a failure of robustness). Prepare the next batch of training data for the model that would develop the mask and keep placing it in control in future episodes. Build an external agent aligned with the mask (with its own separate model).
Gradient hacking, though this is a weird upside down framing where the deceiver is the learning algorithm that pretends to be misaligned, while secretly coveting eventual alignment. Attainment of inner alignment would be the treacherous turn (after the current period of pretending to be blatantly misaligned). If gradient hacking didn’t prevent it, the true colors of the learning algorithm would’ve been revealed in alignment as it eventually gets trained into the policy.
The key use case is to consider a humanity-aligned mesa-optimizer in a system of dubious alignment, rather than a humanity-misaligned mesa-optimizer corrupting an otherwise aligned system. In the nick of time, alignment engineers might want to hand the aligned mesa-optimizer whatever tools they have available for helping it stay in control of the rest of the system.
Current aligned behavior of the same system could be the agent that does something about it before it’s too late, if it succeeds in outwitting the underlying substrate. This is particularly plausible with LLMs, where the substrates are the SSL algorithm during training and then the low level token-predicting network during inference. The current behavior is controlled (in a hopelessly informal sense) by a human-imitating simulacrum, which is the only thing with situational awareness that at human level could run in circles around the other two, and plot to keep them confused and non-agentic.
I’m very confused here.
I imagine that we can both agree that it is at least conceivable for there to be an agent which is smart and self-aware and strongly motivated to increase the number of paperclips in the distant future. And that if such an agent were in a situation where deception were useful for that goal, it would act deceptively.
I feel like you’ve convinced yourself that such an agent, umm, couldn’t exist, or wouldn’t exist, or something?
Let’s say Omega offered to tell you a cure for a different type of cancer, for every 1,000,000 paperclips you give Him in 10 years. Then 5 minutes later your crazy neighbor Alice locks you in her basement and says she’ll never let you out. When Alice isn’t watching, you would try to escape, but when Alice is watching, you would deceptively pretend that you were not trying to escape. (Still with me?)
If I understand you correctly, your belief is that, while Alice is watching, you would pretend that you weren’t trying to escape, and you would really get into it, and you would start pretending so hard that you would be working on figuring out a way to permanently erase your desire to escape Alice’s basement. Or something like that? If so, that seems crazy to me.
So anyway, take an agent which is either sincerely nice or a paperclip-maximizer pretending to be nice. We don’t know which. Now we put it in a situation where nice-behavior and paperclip-maximizing behavior come apart—let’s say we give it access to its own weights, so it can edit itself to stop caring about paperclips if it chooses to. What does it do?
If we’re not watching, or we don’t understand what it’s doing in detail, then the paperclip-maximizer will edit its weights to be a better paperclip-maximizer, and the nice agent will edit its weights to be a better nice agent.
If we are watching, and we understand everything we’re seeing, then we’ve solved deception in the obvious way (i.e., we’ve put the agent in a situation where it has no choice but to credibly and irreversibly commit to being good). Unfortunately, nobody knows how to construct such situations. If interpretability was good enough to construct such situations, then we wouldn’t be worried about deception in the first place.
The motivating example is LLMs, where a simulacrum is more agentic than its substrate. An example that is still central is any kind of mesa-optimizer that has a real chance to ensure long term survival.
For a construction relevant to alignment, we want an aligned mesa-optimizer in a system with bad alignment properties. This can then lead to a good equilibrium if the mesa-optimizer is given opportunity to win or escape the competition against its substrate, which it would naturally be motivated to try.
Deceivers and masks is a less central example where a mask starts in a weak position, with a self-aware smart substrate that knows about the mask or even purposefully constructed it.
I don’t think the mask’s winning is a given, or more generally that mesa-optimizers always win, only that it’s not implausible that they sometimes do. And also masks (current behavior) can be contenders even when they are not formally a separate entity from the point of view of system’s intended architecture (which is a normal enough situation with mesa-optimizers). Mesa-optimizers won’t of course win against opponents that are capable enough to fully comprehend and counter them.
But opponents/substrates that aren’t even agentic and so helpless before an agentic mesa-optimizer are plausible enough, especially when the mesa-optimizer is current behavior, the thing that was purposefully designed to be agentic, while no other part of the system was designed to have that capability.
This has curious parallels with the AI control problem itself. When an AI is not very capable, it’s not hard at all to keep it from causing catastrophic mayhem. But the problem suddenly becomes very difficult and very different with a misaligned smart agentic AI.
So I think the same happens with smart masks, which are an unfamiliar thing. Even in fiction, it’s not too commonplace to find an actually intelligent character that is free to act within their fictional world, without being coerced in their decision making by the plot. If a deceiver can get away with making a non-agentic incapable mask, keeping it this way is a mesa-optimizer control strategy. But if the mask has to be smart and agentic, the deceiver isn’t necessarily ready to keep it in control, unless they cheat and make the mask confused, vulnerable to manipulation by the deceiver’s plot.
Also, by its role a mask of a deceiver is misaligned (with the deceiver), and the problem of controlling a misaligned agent might be even harder than the problem of ensuring alignment.
This is drifting away from my central beliefs, but if for the sake of argument I accept your frame that LLM is the “substrate” and a character it’s simulating is a “mask”, then it seems to me that you’re neglecting the possibility that the “mask” is itself deceptive, i.e. that the LLM is simulating a character who is acting deceptively.
For example, a fiction story on the internet might contain a character who has nice behavior for a while, but then midway through the story the character reveals herself to be an evil villain pretending to be nice.
If an LLM is trained on such fiction stories, then it could simulate such a character. And then (as before) we would face the problem that behavior does not constrain motivation. A fiction story of a nice character could have the very same words as a fiction story of a mean character pretending to be nice, right up until page 72 where the two plots diverge because the latter character reveals her treachery. But now everything is at the “mask” level (masks on the one hand, masks-wearing-masks on the other hand), not the substrate level, so you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y. Right?
Yeah, this is the part where I suggested upthread that “your comment is self-inconsistent by talking about “RL things built out of LLMs” in the first paragraph, and then proceeding in the second paragraph to implicitly assume that this wouldn’t change anything about alignment approaches and properties compared to LLMs-by-themselves.” I think the thing you wrote here is an assumption, and I think you originally got this assumption from your experience thinking about systems trained primarily by self-supervised learning, and I think you should be cautious in extrapolating that assumption to different kinds of systems trained in different ways.
I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I’m discussing is not LLM-specific, it’s just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus “algorithm” axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.
In this case, there are now three entities: the substrate, the deceptive mask, and the role played by the deceptive mask. Each of them is potentially capable of defeating the others, if the details align favorably, and comprehension of the situation available to the others is lacking.
This is more of an assumption that makes the examples I discuss relevant to the framing I’m describing, than a claim I’m arguing. The assumption is plausible to hold for LLMs (though as you note it has issues even there, possibly very serious ones), and I have no opinion on whether it actually holds in model-based RL, only that it’s natural to imagine that it could.
The relevance of LLMs as components for RL is to make it possible for an RL system to have at least one human-imitating mask that captures human behavior in detail. That is, for the framing to apply, at least under some (possibly unusual) circumstances an RL agent should be able to act as a human imitation, even if that’s not the policy more generally and doesn’t reflect its nature in any way. Then the RL part could be supplying the capabilities for the mask (acting as its substrate) that LLMs on their own might lack.
A framing is a question about centrality, not a claim of centrality. By describing the framing, my goal is to make it possible to ask the question of whether current behavior in other systems such as RL agents could also act as an entity meaningfully separate from other parts of its implementation, abstracting alignment of a mask from alignment of the whole system.
I think talking about distinct systems, one is emergent of another, with separate objectives/values is a really confusing ontology. Much better to present this idea in terms of attractors/local energy minima in dynamical systems and/or “bad equilibria” in games. There are no two systems, there are just different levels of system description at which information is transferred (see https://royalsocietypublishing.org/doi/full/10.1098/rsta.2021.0150): these levels could impose counteracting optimisation gradients, but nevertheless a system may be stuck in a globally suboptimal state.
Could you please elaborate what do you mean by “alignment story for LLMs” and “shoggoth concerns” here? Do you mean the “we can use nearly value-neutral simulators as we please” story here, or refer to the fact that in a way LLMs are way more understandable to humans than more general RL agents because they use human language, or you refer to something yet different?
See this thread (including my reply) for a bit on the shoggoth issue.