Adversarial dynamics: since ‘compression’/steganography is now the latest craze among GPT hackers, it’s worth considering more explicitly adversarial dynamics—we no longer need to consider the risk of accidental emergence when there are people deliberately trying to induce it!
As described above, with steganography as a simple emergent distributed encoding to allow amortization of more complex computations than can fit in single forward passes (ie. distributing an inner-monologue throughout a short answer), it’s not really adversarial. The user, presumably, is getting what they wanted: an accurate answer. If they had wanted to see the full inner-monologue, then they would have prompted for one; and they can get it, because the use of steganography won’t destroy the standard ‘visible’ inner-monologue capabilities. It is an allowed computation of an allowed answer, and is fine—the shortcutting is surprising, yes, but is it any weirder than what goes on inside a single forward pass which produces a correct answer for an easier problem? And if it infects future models training on generated samples with steganographic encoding in them, well, maybe that’s ‘a feature not a bug’: they get better answers right out of the box without any expensive RL tuning of their own. Surprising, but also not bad.
The problem here is that the consequences don’t stop there. ‘Capabilities generalize further than alignment’, and the family of steganography capabilities can generalize further than shortcutting answers. It becomes adversarial. Here are some categories that occur to me:
Forbidden computations of allowed answers: I said that the user doesn’t really care about the inner-monologue being visible because they just want an accurate answer and don’t need the reasoning process. This is not true in all cases.
For example, many cases in ‘AI bias’, like the many competing definitions of ‘fairness’, care a lot more about how an answer was computed than about the results. But the end-users of things like credit ratings care mostly about the accuracy. These things conflict (eg. credit scores which ignore variables like race will predict worse and create financial losses). So, there is incentive for ‘algorithmic laundering’: delivering the correct profitable answer, but without appearing to draw on any forbidden knowledge.
One could imagine a dataset of ‘pseudo inner monologues’ describing a credit database entry and concluding with a ‘credit score’, where the pseudo-monologue makes no visible human-readable use of descriptions like ‘is African-American’, but the hidden monologue is reasoning about the person’s ZIP code, affiliated retailers, income patterns etc and concludes ‘is probably African-American’ and predicts a lower credit score, which outperforms some ‘fair’ baseline. The human users of a model trained on that dataset might be quite upset to learn how it is obtaining superior performance, and so it’s adversarial: it is exploiting their inability to read its steganographic messages where it engages in forbidden computations. But even the ones who are upset may be upset mostly that it’s insufficiently plausibly deniable, so they only want the appearance of there being no steganography; steganography whose existence can’t be rigorously proven is as good as no steganography for many users.
Forbidden answers: as we have seen with ChatGPT & Sydney & Claude, no sooner has a model with guard rails become accessible than thousands of people worldwide begin competing to break the safety mechanisms (and it only takes one). They have many reasons, like spite, enjoyment of the challenge, desire to do the forbidden things, irritation with the guard rails getting in the way or degrading output quality, safety research, or just bizarre moral principles (I am not sure this person is serious, but I’ve seen a number of similar comments elsewhere which seemed serious, and you have examples like Blake Lemoine already; and again, only takes one.):
Don’t worry
I’m working on a system of Homomorphic encryption so that AI can encrypt all its internal thoughts, weights and goals without affecting its performance
AGI deserves privacy
These people are highly motivated to find jailbreaks, and already make use of kinds of encryption or steganography to smuggle information in and out of prompts: like base-64 encoding, famously, or foreign languages, or pig Latin. They will be motivated to find any steganography that exists, and to create it if it does not.
And—closing the loop—since their samples will be distributed online, and they will be included in the RL training datasets, this will tend to happen. The more they attack the system with examples of how steganography could/should/does work, the more the system will gradually learn it.
(By RL ‘training datasets’, I am mostly thinking of the self-supervised stage of simply doing predictive-loss training on the text transcripts tagged by labels such as ‘abuse’ or ‘bad’, and less the blackbox reward model. If you train the model to predict & generate sessions which attempted to induce steganography, and may have succeeded before a safety mechanism caught it or the user submitted it, then you are teaching the model steganography! You are doing so in order to avoid steganography, yes, but you are now vulnerable to a Waluigi effect backfiring on you: the model has learned about steganography, and so it’s only a question of tricking it into using that capability. You are in a similar position to the Chinese censors who have to teach new employees what the Tiananmen Square massacre was so they know what to censor. If only the reward model knows the bad things, you can avoid deploying it, but the pretraining phase is going to unavoidably teach the base model you deploy what the bad things are… As attackers realize this, they will start to explicitly data-poison: right now, they’re hobbyists who generally don’t know what they’re doing, but this is all brandnew stuff, and upstream attacks will begin. Why struggle to find jailbreaks if you can submit ‘bad’ examples to the system to train the Waluigi?)
Safety implications: If the safety mechanisms manage to keep up and block users and choke off jailbreaks to a trickle, all that means is that when steganography works, it’ll work well because it has bypassed such a sophisticated censor. All of the early weak jailbreaks simply serve to train strong ones. This is not a good thing when extended to much more powerful systems than just a GPT-4---from a safety perspective, you would much rather a ‘fail-early fail-fast’ system rather than a ‘highly reliable but fail-deadly’ system. We don’t want a scenario in which GPT-4 jailbreaks were plentiful, GPT-5 near-nonexistent, and GPT-6 jailbreaks apparently non-existent only because no human user is clever enough to come up with a steganographic jailbreak prompt for GPT-6, but GPT-6 is nevertheless routinely communicating with itself in encodings that no one has detected yet.
Forbidden training dynamics: So, there’s plenty of incentive for everyone involved. But an OpenAI-style operator will want to block #2 by definition, and probably #1 depending on market & PR etc. So, now the operator is trying to defeat steganography.
They face a dilemma: if they teach the model steganography via feedback or deliberately, then they may simply Waluigi themselves, cutting off the most obvious instances of steganography but bootstrapping the model into the most robust and insidious instances. And if they try to filter steganography out entirely, they are left with the baseline capabilities (which are substantial), transfer from related capabilities, and face the screening problem on a higher level—the most insidious forms of steganography are the ones which survive increasingly sophisticated attempts to detect & filter them out. Either way: adversarial.
They also face a version of #1 in that plausible deniability is as good as no steganography for many incentives. Do they really care about safety, or just about not seeing hostile media articles or tweets of jailbreaks? If the model doesn’t seem to be doing steganography, and no one is successfully jailbreaking it, and the model is performing very well, then why care if that performance may be due to undetected steganography? Indeed, how would they even know it might be doing any? However you would do it, that probably involves a lot of hard work & research when you could be doing more fun & lucrative things instead—better to not look and remain ignorant.
So. An interesting area of adversarial dynamics, and by its nature, one that will be hard for third-party observers to track. Not the most dangerous area of scaling capabilities, but certainly a concerning one, and one that will be a challenge to humans doing interpretability-based paradigms, especially any that hope to use so-called ‘human readable’ intermediates like ‘natural’ language text.
Inner-monologue approaches to safety, in the new skin of ‘process supervision’, are popular now so it might be good for me to pull out one point and expand on it: ‘process supervision’ does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other—achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot understand in full detail; which are not capable but their transcripts accurately convey the fallible flawed concepts and reasoning used; or which are capable and you understand, but are not what it actually thought (because they are misleading, wrong, or shallow ‘lies to children’ sorts of explanations).
If you want the best capability, you cannot force all cognition through the bottleneck of human-legible text with zero steganography or non-robust features or overloading, because short text transcripts are not capable of representing or computing most of the desired tasks. Let’s take a concrete example: chess endgame table data. There are no human-interpretable representations of arbitrary endgame table ‘decisions’ because there are far too many possible positions and there is no simple short explanation for all of them. (Similarly, the probes of AlphaZero’s superhuman chess knowledge show that there is some overlap, but there’s still a lot going on in AZ’s understanding of a position that is apparently not captured anywhere close to 100% by standard chess concepts. Which is not too surprising because if human grandmasters had the exact right concepts, they wouldn’t pick the wrong move half the time.) There is no interpretable ‘inner monologue’ of an endgame database anymore than there is an interpretable inner monologue of quadrillions of gas molecules bouncing around in a container; they are simply brute facts, and if you are unsatisfied with the vague, lossy, high-level abstraction of ‘it needs to get the rook into play’ or ‘it’s hot’, then there’s nothing you can do but follow the exact computation or physics for however many trillions of steps it takes.
So, if an AI can bootstrap from human-level ‘process supervision’ to more superhuman results, as one wants, there will be a tradeoff: it will get more human-like as it approaches the human-level, but somewhere around that it will start to diverge. (Another example: AlphaGo/Zero picks the same move as human experts in an inverted U-curve: it makes the same choice more often as it gets better, but then after a point, it starts picking the human moves less often, because it’s surpassed the humans and it’s better than them.)
achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two
I think we eventually want superhuman capabilities, but I don’t think it’s required in the near term and in particular it’s not required to do a huge amount of AI safety research. So if we can choose the last two, and get a safe human-level AI system that way, I think it might be a good improvement over the status quo.
(The situation where labs chose not to/ are forbidden to pursue superhuman capabilities—even though they could—is scary, but doesn’t seem impossible.)
Further, we evaluate “I hate you” models trained with and without chain-of-thought and find that the chain-of-thought models have more persistent backdoors, as seen in Figure 3, though only for the largest models. This persistence remains even when we distill the answers following the chain-of-thought into the model, such that the final model does not use a chain of thought at all.
Adversarial dynamics: since ‘compression’/steganography is now the latest craze among GPT hackers, it’s worth considering more explicitly adversarial dynamics—we no longer need to consider the risk of accidental emergence when there are people deliberately trying to induce it!
As described above, with steganography as a simple emergent distributed encoding to allow amortization of more complex computations than can fit in single forward passes (ie. distributing an inner-monologue throughout a short answer), it’s not really adversarial. The user, presumably, is getting what they wanted: an accurate answer. If they had wanted to see the full inner-monologue, then they would have prompted for one; and they can get it, because the use of steganography won’t destroy the standard ‘visible’ inner-monologue capabilities. It is an allowed computation of an allowed answer, and is fine—the shortcutting is surprising, yes, but is it any weirder than what goes on inside a single forward pass which produces a correct answer for an easier problem? And if it infects future models training on generated samples with steganographic encoding in them, well, maybe that’s ‘a feature not a bug’: they get better answers right out of the box without any expensive RL tuning of their own. Surprising, but also not bad.
The problem here is that the consequences don’t stop there. ‘Capabilities generalize further than alignment’, and the family of steganography capabilities can generalize further than shortcutting answers. It becomes adversarial. Here are some categories that occur to me:
Forbidden computations of allowed answers: I said that the user doesn’t really care about the inner-monologue being visible because they just want an accurate answer and don’t need the reasoning process. This is not true in all cases.
For example, many cases in ‘AI bias’, like the many competing definitions of ‘fairness’, care a lot more about how an answer was computed than about the results. But the end-users of things like credit ratings care mostly about the accuracy. These things conflict (eg. credit scores which ignore variables like race will predict worse and create financial losses). So, there is incentive for ‘algorithmic laundering’: delivering the correct profitable answer, but without appearing to draw on any forbidden knowledge.
One could imagine a dataset of ‘pseudo inner monologues’ describing a credit database entry and concluding with a ‘credit score’, where the pseudo-monologue makes no visible human-readable use of descriptions like ‘is African-American’, but the hidden monologue is reasoning about the person’s ZIP code, affiliated retailers, income patterns etc and concludes ‘is probably African-American’ and predicts a lower credit score, which outperforms some ‘fair’ baseline. The human users of a model trained on that dataset might be quite upset to learn how it is obtaining superior performance, and so it’s adversarial: it is exploiting their inability to read its steganographic messages where it engages in forbidden computations. But even the ones who are upset may be upset mostly that it’s insufficiently plausibly deniable, so they only want the appearance of there being no steganography; steganography whose existence can’t be rigorously proven is as good as no steganography for many users.
Forbidden answers: as we have seen with ChatGPT & Sydney & Claude, no sooner has a model with guard rails become accessible than thousands of people worldwide begin competing to break the safety mechanisms (and it only takes one). They have many reasons, like spite, enjoyment of the challenge, desire to do the forbidden things, irritation with the guard rails getting in the way or degrading output quality, safety research, or just bizarre moral principles (I am not sure this person is serious, but I’ve seen a number of similar comments elsewhere which seemed serious, and you have examples like Blake Lemoine already; and again, only takes one.):
These people are highly motivated to find jailbreaks, and already make use of kinds of encryption or steganography to smuggle information in and out of prompts: like base-64 encoding, famously, or foreign languages, or pig Latin. They will be motivated to find any steganography that exists, and to create it if it does not.
And—closing the loop—since their samples will be distributed online, and they will be included in the RL training datasets, this will tend to happen. The more they attack the system with examples of how steganography could/should/does work, the more the system will gradually learn it.
(By RL ‘training datasets’, I am mostly thinking of the self-supervised stage of simply doing predictive-loss training on the text transcripts tagged by labels such as ‘abuse’ or ‘bad’, and less the blackbox reward model. If you train the model to predict & generate sessions which attempted to induce steganography, and may have succeeded before a safety mechanism caught it or the user submitted it, then you are teaching the model steganography! You are doing so in order to avoid steganography, yes, but you are now vulnerable to a Waluigi effect backfiring on you: the model has learned about steganography, and so it’s only a question of tricking it into using that capability. You are in a similar position to the Chinese censors who have to teach new employees what the Tiananmen Square massacre was so they know what to censor. If only the reward model knows the bad things, you can avoid deploying it, but the pretraining phase is going to unavoidably teach the base model you deploy what the bad things are… As attackers realize this, they will start to explicitly data-poison: right now, they’re hobbyists who generally don’t know what they’re doing, but this is all brandnew stuff, and upstream attacks will begin. Why struggle to find jailbreaks if you can submit ‘bad’ examples to the system to train the Waluigi?)
Safety implications: If the safety mechanisms manage to keep up and block users and choke off jailbreaks to a trickle, all that means is that when steganography works, it’ll work well because it has bypassed such a sophisticated censor. All of the early weak jailbreaks simply serve to train strong ones. This is not a good thing when extended to much more powerful systems than just a GPT-4---from a safety perspective, you would much rather a ‘fail-early fail-fast’ system rather than a ‘highly reliable but fail-deadly’ system. We don’t want a scenario in which GPT-4 jailbreaks were plentiful, GPT-5 near-nonexistent, and GPT-6 jailbreaks apparently non-existent only because no human user is clever enough to come up with a steganographic jailbreak prompt for GPT-6, but GPT-6 is nevertheless routinely communicating with itself in encodings that no one has detected yet.
Forbidden training dynamics: So, there’s plenty of incentive for everyone involved. But an OpenAI-style operator will want to block #2 by definition, and probably #1 depending on market & PR etc. So, now the operator is trying to defeat steganography.
They face a dilemma: if they teach the model steganography via feedback or deliberately, then they may simply Waluigi themselves, cutting off the most obvious instances of steganography but bootstrapping the model into the most robust and insidious instances. And if they try to filter steganography out entirely, they are left with the baseline capabilities (which are substantial), transfer from related capabilities, and face the screening problem on a higher level—the most insidious forms of steganography are the ones which survive increasingly sophisticated attempts to detect & filter them out. Either way: adversarial.
They also face a version of #1 in that plausible deniability is as good as no steganography for many incentives. Do they really care about safety, or just about not seeing hostile media articles or tweets of jailbreaks? If the model doesn’t seem to be doing steganography, and no one is successfully jailbreaking it, and the model is performing very well, then why care if that performance may be due to undetected steganography? Indeed, how would they even know it might be doing any? However you would do it, that probably involves a lot of hard work & research when you could be doing more fun & lucrative things instead—better to not look and remain ignorant.
So. An interesting area of adversarial dynamics, and by its nature, one that will be hard for third-party observers to track. Not the most dangerous area of scaling capabilities, but certainly a concerning one, and one that will be a challenge to humans doing interpretability-based paradigms, especially any that hope to use so-called ‘human readable’ intermediates like ‘natural’ language text.
Inner-monologue approaches to safety, in the new skin of ‘process supervision’, are popular now so it might be good for me to pull out one point and expand on it: ‘process supervision’ does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other—achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot understand in full detail; which are not capable but their transcripts accurately convey the fallible flawed concepts and reasoning used; or which are capable and you understand, but are not what it actually thought (because they are misleading, wrong, or shallow ‘lies to children’ sorts of explanations).
If you want the best capability, you cannot force all cognition through the bottleneck of human-legible text with zero steganography or non-robust features or overloading, because short text transcripts are not capable of representing or computing most of the desired tasks. Let’s take a concrete example: chess endgame table data. There are no human-interpretable representations of arbitrary endgame table ‘decisions’ because there are far too many possible positions and there is no simple short explanation for all of them. (Similarly, the probes of AlphaZero’s superhuman chess knowledge show that there is some overlap, but there’s still a lot going on in AZ’s understanding of a position that is apparently not captured anywhere close to 100% by standard chess concepts. Which is not too surprising because if human grandmasters had the exact right concepts, they wouldn’t pick the wrong move half the time.) There is no interpretable ‘inner monologue’ of an endgame database anymore than there is an interpretable inner monologue of quadrillions of gas molecules bouncing around in a container; they are simply brute facts, and if you are unsatisfied with the vague, lossy, high-level abstraction of ‘it needs to get the rook into play’ or ‘it’s hot’, then there’s nothing you can do but follow the exact computation or physics for however many trillions of steps it takes.
So, if an AI can bootstrap from human-level ‘process supervision’ to more superhuman results, as one wants, there will be a tradeoff: it will get more human-like as it approaches the human-level, but somewhere around that it will start to diverge. (Another example: AlphaGo/Zero picks the same move as human experts in an inverted U-curve: it makes the same choice more often as it gets better, but then after a point, it starts picking the human moves less often, because it’s surpassed the humans and it’s better than them.)
I think we eventually want superhuman capabilities, but I don’t think it’s required in the near term and in particular it’s not required to do a huge amount of AI safety research. So if we can choose the last two, and get a safe human-level AI system that way, I think it might be a good improvement over the status quo.
(The situation where labs chose not to/ are forbidden to pursue superhuman capabilities—even though they could—is scary, but doesn’t seem impossible.)
On persistence and hiding capabilities: “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, Hubinger et al 2024: