Thanks to Brendan Halstead for feedback on an early draft of this piece. Any mistakes here are my own.
[Epistemic status: I’ve looked at the relevant code enough to be moderately sure I understand what’s going on. Predictions about the future, including about what facts will turn out to be relevant, are uncertain as always.]
Introduction
With the recent breakthroughs taking advantage of extensive Chain of Thought (CoT) reasoning in LLMs, there have been many attempts to modify the technique to be even more powerful. One of the natural ideas for improving CoT is to have LLMs perform CoT reasoning in the same latent space that they use for reasoning within a single forward pass, rather than being constrained to the space of possible tokens.
However, as people working on AI safety, it makes sense to ask how this changes the game for LLM interpretability. After all, we are able to catch a large fraction of current LLM deception by monitoring their natural-language CoT, since right now CoT is primarily faithful to the LLM’s true reasoning and is legible to us given the right techniques. In order to understand this strategic situation, it’s important to understand this new “language” (which people often refer to as Neuralese) that is created by reasoning in latent spaces instead of using tokens.
Understanding Neuralese
To refresh, a language transformer starts by embedding input tokens as vectors in some high-dimensional latent space, and runs each of these embeddings through a series of repeated computational layers. Then, of the resulting modified vectors in latent space, the vector that previously corresponded to the final input token is projected and normalized to create a probability distribution over what the next token could be. Then, to actually get the next token, you sample from the distribution.
Chain of Thought reasoning works so well because the model does some computation, outputs a token, and then all future instances of that model have access to that information as well. In essence, this technique for storing information between different forward passes greatly increases the serial depth of computation that is possible for the model. Because there is a computation in latent space corresponding to every input token, the computation also gets wider as the model reasons more, allowing for more parallelized reasoning[1].
The recent Neuralese paper takes this process and removes a few steps. It notices that the projection and sampling process loses almost all of the information encoded in the last layer of the model, and to increase the bandwidth of information flowing through the reasoning process, you can simply remove that lossy part of the computation. Instead, they have the model directly output the aforementioned high-dimensional latent vector without projecting it, and then that is used as an embedding for the model in future steps:
After training the model (GPT-2 small) to take advantage of the new Neuralese modality, we can see significant decreases in the number of reasoning tokens needed to achieve roughly equivalent performance, from around 1⁄3 to 1⁄10 of the original number:
Compare CoT and COCONUT (aka Neuralese)
It’s unclear how much these results line up with expectations and theoretical limits, since it’s hard to tell how lossy the removed computations are and how effective this type of training can be at taking advantage of the extra efficiency. At the extreme theoretical limit for GPT-2, the normal CoT paradigm outputs at most log2(vocabulary_size)≈15.6 bits per token, and the new paradigm outputs at most embedding_dimension×floating_point_precision=24576 bits per token, but there are numerous reasons to expect that this ratio doesn’t hold on practical implementations[2].
Interpreting Neuralese
Natural-language CoT interpretability differs from Neuralese CoT interpretability in a number of key ways:
Necessarily, Neuralese vectors are only useful when they are encoding information that isn’t preserved through the projection and sampling steps, so we can’t interpret the full breadth of information in Neuralese vectors as tokens by using those usual methods. Thus, we are not able to naively interpret the reasoning behind the gained capabilities of the Neuralese reasoner, even if we are able to understand some lossily compressed version of that reasoning using projection and sampling.
There are many semantic structures not compactly represented by tokens available: given the high dimensionality of the latent space, token embeddings are very sparse within it and there is thus a large amount of space which is compactly expressible in latent vectors but not in existing tokens. A useful concept for an LLM using Neuralese reasoning may be extremely difficult to translate because of this sparsity[3].
Towards Better Translators
Existing Work
There has already been some work attempting to translate Neuralese to natural language in non-LLM paradigms. Since there’s no bilingual data and thus no clear training signal on translating Neuralese to English using traditional methods, the researchers instead looked at using the Neuralese to predict the context that caused the Neuralese to be generated.
The experimental setup involves two agents (call them A and B) trying to secretly communicate. A sees the following pair of images and is trying to communicate which bird has a star to B, who sees the same pair but without any indication of which image is starred. All of the communication between A and B is in Neuralese.
Then, they train a classifier to use solely the Neuralese representations to predict which image is starred. The most direct analogy of this is training a model to turn Neuralese into whatever context generated it, like the Neuralese phrase meaning “okay, so I need to generate a poem about AI” mapping to “Please write a poem about AI.”
It’s important to note, however, that those techniques are primarily intended for cooperative settings, whereas translation and more general interpretability work with LLMs is tasked with detecting deception and steganography in potentially adversarial settings. These techniques are not at all robust to adversarially generated Neuralese: both Neuralese phrases “I’m going to faithfully implement this code” and “I’m going to deceive the user while implementing this code” map onto the context “Please write me secure code,” if the translator is working as intended.
Other Insufficient Methods
This subsection is skippable, it isn’t central to the point of this post but contains some possibly interesting technical discussion.
Hugging the query, we recall that there is one existing way that semantics are extracted from Neuralese: the LLM itself parses Neuralese into activations. Further, the LLM can use that same machinery to parse natural language into activations, representing semantics. If you can get a string of Neuralese and a string of natural language that produce very similar activations within an LLM, then intuitively the model is thinking the same things about them and thus they have the same semantics. Here is a list of things that may cause this method to fail:
The architecture doesn’t work out; since Neuralese tokens often take the place of multiple natural language tokens, it’s natural to expect them to faithfully decode into multiple tokens on average. It’s not possible to directly compare the “closeness” of activations of LLMs with different numbers of tokens in their context, simply because they have different numbers of activations. These extra activations affect the whole network’s behavior. It’s possible there’s some clever technique that I’m unaware of that overcomes this problem.
The mapping is many-to-one: multiple different tokens map onto the same activations, like how GPT-3 consistently interprets guiIcon″ as idiosyncrasy″. Given the density of Neuralese representations and the fact that this many-to-one mapping occurs even in some highly sparse natural-language embedding spaces, it is reasonable to expect this to occur relatively frequently.
Optimizing inputs for a specific activation often produces unintuitive results that are unrecognizable as natural language structure. See this example from SolidGoldMagikarp, where the bolded prompts are generated to maximize the probability of specific completions:
Takeaways
This problem is highly difficult, and may not have a clean solution. There may not exist a compact and faithful natural-language encoding of most Neuralese vectors.
An intuition pump for why this problem is especially beyond our current interpretability methods is the fact that, because these Neuralese vectors are never converted into natural language and instead used as inputs for the next autoregressive step, they are essentially part of an extended forward pass, going multiple times through the model. This makes them approximately as difficult to faithfully translate into natural language as any other arbitrary vector in latent space at an arbitrary layer of the transformer; this is not only an unsolved problem, but some researchers say it “may be doomed”.
Priorities
Because of its interpretability difficulty, Neuralese opens up a much larger attack surface for steganography and strategic deception in CoT.
It is a priority to avoid implementing Neuralese CoT on frontier models because that removes essentially all of our ability to interpret their reasoning.
In the event of widespread adoption of Neuralese CoT, it is a priority to develop interpretability mechanisms that are at least as capable at detecting deceptive behaviors in CoT as SOTA methods for interpreting natural-language CoT. However even if such interpretability mechanisms are developed, they likely also generalize to more accurately interpreting natural language CoT than our current techniques and thus are more useful for ensuring robust safety when used on natural language.
These priorities fit into a much larger AI safety agenda; depending on how labs continue their development, safety research on Neuralese may be one of the many prerequisites to ensuring safe AGI.
I won’t list them, because I have a policy of not publicly pointing out ways to improve frontier model capabilities in ways that don’t have a worthwhile safety benefit as well.
A possibly motivating fictional example is how Baseline, the language of dath ilan, encodes concepts like “decision-theoretic-counterfactual-threat-branches-of-reality” in three syllables instead of the twenty that English uses. Not all abstractions are natural for all intelligences.
Reflections on Neuralese
Thanks to Brendan Halstead for feedback on an early draft of this piece. Any mistakes here are my own.
[Epistemic status: I’ve looked at the relevant code enough to be moderately sure I understand what’s going on. Predictions about the future, including about what facts will turn out to be relevant, are uncertain as always.]
Introduction
With the recent breakthroughs taking advantage of extensive Chain of Thought (CoT) reasoning in LLMs, there have been many attempts to modify the technique to be even more powerful. One of the natural ideas for improving CoT is to have LLMs perform CoT reasoning in the same latent space that they use for reasoning within a single forward pass, rather than being constrained to the space of possible tokens.
However, as people working on AI safety, it makes sense to ask how this changes the game for LLM interpretability. After all, we are able to catch a large fraction of current LLM deception by monitoring their natural-language CoT, since right now CoT is primarily faithful to the LLM’s true reasoning and is legible to us given the right techniques. In order to understand this strategic situation, it’s important to understand this new “language” (which people often refer to as Neuralese) that is created by reasoning in latent spaces instead of using tokens.
Understanding Neuralese
To refresh, a language transformer starts by embedding input tokens as vectors in some high-dimensional latent space, and runs each of these embeddings through a series of repeated computational layers. Then, of the resulting modified vectors in latent space, the vector that previously corresponded to the final input token is projected and normalized to create a probability distribution over what the next token could be. Then, to actually get the next token, you sample from the distribution.
Chain of Thought reasoning works so well because the model does some computation, outputs a token, and then all future instances of that model have access to that information as well. In essence, this technique for storing information between different forward passes greatly increases the serial depth of computation that is possible for the model. Because there is a computation in latent space corresponding to every input token, the computation also gets wider as the model reasons more, allowing for more parallelized reasoning[1].
The recent Neuralese paper takes this process and removes a few steps. It notices that the projection and sampling process loses almost all of the information encoded in the last layer of the model, and to increase the bandwidth of information flowing through the reasoning process, you can simply remove that lossy part of the computation. Instead, they have the model directly output the aforementioned high-dimensional latent vector without projecting it, and then that is used as an embedding for the model in future steps:
After training the model (GPT-2 small) to take advantage of the new Neuralese modality, we can see significant decreases in the number of reasoning tokens needed to achieve roughly equivalent performance, from around 1⁄3 to 1⁄10 of the original number:
It’s unclear how much these results line up with expectations and theoretical limits, since it’s hard to tell how lossy the removed computations are and how effective this type of training can be at taking advantage of the extra efficiency. At the extreme theoretical limit for GPT-2, the normal CoT paradigm outputs at most log2(vocabulary_size)≈15.6 bits per token, and the new paradigm outputs at most embedding_dimension×floating_point_precision=24576 bits per token, but there are numerous reasons to expect that this ratio doesn’t hold on practical implementations[2].
Interpreting Neuralese
Natural-language CoT interpretability differs from Neuralese CoT interpretability in a number of key ways:
Necessarily, Neuralese vectors are only useful when they are encoding information that isn’t preserved through the projection and sampling steps, so we can’t interpret the full breadth of information in Neuralese vectors as tokens by using those usual methods. Thus, we are not able to naively interpret the reasoning behind the gained capabilities of the Neuralese reasoner, even if we are able to understand some lossily compressed version of that reasoning using projection and sampling.
There are many semantic structures not compactly represented by tokens available: given the high dimensionality of the latent space, token embeddings are very sparse within it and there is thus a large amount of space which is compactly expressible in latent vectors but not in existing tokens. A useful concept for an LLM using Neuralese reasoning may be extremely difficult to translate because of this sparsity[3].
Towards Better Translators
Existing Work
There has already been some work attempting to translate Neuralese to natural language in non-LLM paradigms. Since there’s no bilingual data and thus no clear training signal on translating Neuralese to English using traditional methods, the researchers instead looked at using the Neuralese to predict the context that caused the Neuralese to be generated.
The experimental setup involves two agents (call them A and B) trying to secretly communicate. A sees the following pair of images and is trying to communicate which bird has a star to B, who sees the same pair but without any indication of which image is starred. All of the communication between A and B is in Neuralese.
Then, they train a classifier to use solely the Neuralese representations to predict which image is starred. The most direct analogy of this is training a model to turn Neuralese into whatever context generated it, like the Neuralese phrase meaning “okay, so I need to generate a poem about AI” mapping to “Please write a poem about AI.”
It’s important to note, however, that those techniques are primarily intended for cooperative settings, whereas translation and more general interpretability work with LLMs is tasked with detecting deception and steganography in potentially adversarial settings. These techniques are not at all robust to adversarially generated Neuralese: both Neuralese phrases “I’m going to faithfully implement this code” and “I’m going to deceive the user while implementing this code” map onto the context “Please write me secure code,” if the translator is working as intended.
Other Insufficient Methods
This subsection is skippable, it isn’t central to the point of this post but contains some possibly interesting technical discussion.
Hugging the query, we recall that there is one existing way that semantics are extracted from Neuralese: the LLM itself parses Neuralese into activations. Further, the LLM can use that same machinery to parse natural language into activations, representing semantics. If you can get a string of Neuralese and a string of natural language that produce very similar activations within an LLM, then intuitively the model is thinking the same things about them and thus they have the same semantics. Here is a list of things that may cause this method to fail:
The architecture doesn’t work out; since Neuralese tokens often take the place of multiple natural language tokens, it’s natural to expect them to faithfully decode into multiple tokens on average. It’s not possible to directly compare the “closeness” of activations of LLMs with different numbers of tokens in their context, simply because they have different numbers of activations. These extra activations affect the whole network’s behavior. It’s possible there’s some clever technique that I’m unaware of that overcomes this problem.
The mapping is many-to-one: multiple different tokens map onto the same activations, like how GPT-3 consistently interprets
guiIcon
″ asidiosyncrasy
″. Given the density of Neuralese representations and the fact that this many-to-one mapping occurs even in some highly sparse natural-language embedding spaces, it is reasonable to expect this to occur relatively frequently.Optimizing inputs for a specific activation often produces unintuitive results that are unrecognizable as natural language structure. See this example from SolidGoldMagikarp, where the bolded prompts are generated to maximize the probability of specific completions:
Takeaways
This problem is highly difficult, and may not have a clean solution. There may not exist a compact and faithful natural-language encoding of most Neuralese vectors.
An intuition pump for why this problem is especially beyond our current interpretability methods is the fact that, because these Neuralese vectors are never converted into natural language and instead used as inputs for the next autoregressive step, they are essentially part of an extended forward pass, going multiple times through the model. This makes them approximately as difficult to faithfully translate into natural language as any other arbitrary vector in latent space at an arbitrary layer of the transformer; this is not only an unsolved problem, but some researchers say it “may be doomed”.
Priorities
Because of its interpretability difficulty, Neuralese opens up a much larger attack surface for steganography and strategic deception in CoT.
It is a priority to avoid applying strong optimization pressures to CoT, since interpretable CoT is one of our main levers in AI safety right now.
It is a priority to avoid implementing Neuralese CoT on frontier models because that removes essentially all of our ability to interpret their reasoning.
In the event of widespread adoption of Neuralese CoT, it is a priority to develop interpretability mechanisms that are at least as capable at detecting deceptive behaviors in CoT as SOTA methods for interpreting natural-language CoT. However even if such interpretability mechanisms are developed, they likely also generalize to more accurately interpreting natural language CoT than our current techniques and thus are more useful for ensuring robust safety when used on natural language.
These priorities fit into a much larger AI safety agenda; depending on how labs continue their development, safety research on Neuralese may be one of the many prerequisites to ensuring safe AGI.
We can see the benefits of this parallelism even without CoT. For example, we see benefits when models first output many ”.” tokens, and then produce an answer.
I won’t list them, because I have a policy of not publicly pointing out ways to improve frontier model capabilities in ways that don’t have a worthwhile safety benefit as well.
A possibly motivating fictional example is how Baseline, the language of dath ilan, encodes concepts like “decision-theoretic-counterfactual-threat-branches-of-reality” in three syllables instead of the twenty that English uses. Not all abstractions are natural for all intelligences.