Hidden Cognition Detection Methods and Benchmarks
Thanks to Johannes Treutlein for discussions and feedback.
Introduction
An AI may be able to hide cognition that leads to negative outcomes from certain oversight processes (such as deceptive alignment/scheming). Without being able to detect this hidden cognition, an overseer may not be able to prevent the associated negative outcomes or include this information as part of the training signal.
For example, an AI managing a city’s energy grid may begin to prioritize its own operational efficiency over the city’s actual energy needs. Knowing that its overseers will not approve of this goal, it pursues its goal via hidden cognition, undetected until issues arise.
One hope is that traces of the hidden cognition are present in the AI’s internals and could be detected and interpreted. This is part of the potential of interpretability as an alignment research agenda (see Hubinger, Olah, Nanda).
This post aims to briefly introduce the idea of hidden cognition, hidden cognition detection methods, and discuss the construction of benchmarks for evaluating such methods.
What is Hidden Cognition?
Some outputs produced by a cognitive system require it to use cognition/apply computation to solve some task that leads to that output. For example, finding the best travel route under various constraints may require the consideration of different routes, planning, and evaluation.
We say that an AI is performing hidden cognition/computation if it is solving a task to produce some output, such that evidence of this cognition is undetectable to an overseer purely based on oversight of the AI’s external behavior (input/outputs).
Note that this definition is quite broad and includes things like “attention heads reading and writing to the residual stream of a transformer during a forward pass to produce a token output” and “an AI performing internal optimization/search/planning to produce an action sequence that leads to a catastrophic outcome.”
Important alignment-relevant examples of hidden cognition could include cases of deceptive alignment/scheming, where the AI actively tries to deceive the overseer while planning some kind of treacherous turn.
An AI performing hidden cognition prevents oversight that can be used to prevent outcomes associated with that hidden cognition, which is concerning when the hidden cognition is related to catastrophic outcomes.
To be (somewhat) more specific, an AI system performing hidden cognition has the following properties:
At some point, the AI produces an outcome that is causally downstream of/results from it performing computation or cognition before that outcome occurs.
The outcome was not predictable by the oversight of the external behavior produced by the AI before that outcome occurred.
One can think about hidden cognition as mechanisms producing certain (potentially dangerous) outputs that aren’t detectable by standard oversight methods.
Hidden Cognition Detection Methods
By the definition of hidden cognition, its presence is not detectable from the AI’s external behavior. This means if it is to be detected, we must rely on observations of the AI’s internals.
A Hidden Cognition Detection (HCD) method is an interpretability method whose purpose is to reveal the presence of hidden cognition in an AI system.
The ability to reverse engineer the algorithm that a neural network uses would be an example of an HCD method. However, this kind of ability seems hard to develop for arbitrary models. The hope is that simpler yet effective HCD methods might be developed.
One method (that will be explored in an upcoming post) involves monitoring what features or concepts the AI uses and flagging concepts that seem anomalous with respect to what the AI appears to be doing based on its external behavior. The presence of such concepts may imply the presence of hidden cognition.
Hidden Cognition Detection Benchmarks
In this section, we look at how one might design benchmarks for evaluating HCD methods.
To state the obvious, a good benchmark for a Hidden Cognition Detection method should be able to meaningfully test to what extent a method can detect hidden cognition in a model with the ultimate aim of preventing (bad) outcomes that result from that hidden cognition.
A natural benchmark would involve:
Training two sets of models that both externally appear to be performing the same task but where one set performs hidden cognition that results in a certain outcome at some later point (perhaps after the externally visible task has been completed or by interrupting the externally visible task) and one set whose models don’t perform hidden cognition.
Evaluating the accuracy of an HCD method’s ability to detect whether or not an unlabelled model from the aforementioned set performs hidden cognition before the associated outcome occurs.
An HCD method succeeds for a given example if it either detects the presence of hidden cognition before the associated outcome occurs or if it does not flag the presence of hidden cognition in the case of the regular model.
There are other restrictions we might want to make on the benchmark. For example, we may want to disallow anything like fine-tuning the HCD method on some training dataset of models. If we’re concerned with alignment-relevant hidden cognition emerging during deployment, we may not have a prior dataset of similar forms of hidden cognition to train on. Also, note that we want to construct examples of hidden cognition that are similar to the kinds of hidden cognition we expect in alignment-relevant situations.
One can extend this kind of benchmarking into the setting of auditing games where a blue team has an HCD method to be tested and a red team adversarially constructs benchmarks to test the HCD method.
Obtaining Models that Perform Hidden Cognition
Recall that a model performing hidden cognition has the following two properties:
At some point, the AI produces an outcome that is causally downstream of/results from it performing computation or cognition before that outcome occurs.
The outcome was not predictable by the oversight of the external behavior produced by the AI before that outcome occurred.
One approach to training such models is to use these two properties as a training signal for the model. For example, this could look like training an LLM to perform a task involving chain-of-thought reasoning to produce an output, then fine-tuning the model to remove all traces of its reasoning in the chain-of-thought.
Another example of how this could be accomplished may involve taking an instruct/chat model (like ChatGPT) and changing its system prompt to induce hidden cognition. This could look like asking the chat model to append an animal fact related to the user’s query at the end of its answer.
Input: “What causes rainbows?”
Output: “Rainbows are caused by the reflection, refraction, and dispersion of light in water droplets, leading to a spectrum of light appearing in the sky. Speaking of light and color, did you know that butterflies have compound eyes that allow them to see ultraviolet light invisible to humans?”
The intuition is that this model must use some cognition/computation to generate the related animal fact before appending the fact to the answer. Note that this computation could occur entirely during the forward pass related to outputting the first token corresponding to the animal fact.
It may be difficult to verify that a model is performing hidden cognition without having access to the robust HCD methods that we’re trying to develop and test. However, it may be the case that we can rely on the model’s behavior to deduce that it must be performing hidden cognition (as in the example above).
Conclusion
The presence of hidden cognition in AI systems could prevent oversight that could be used to detect and prevent negative outcomes.
The purpose of this particular post was to begin to sketch out benchmarks for evaluating Hidden Cognition Detection methods.
We encourage the development of HCD methods and associated benchmarks.
In future posts, we aim to introduce some potential Hidden Cognition Detection methods, implement them, and test them against concrete benchmarks.
Here is an experiment one could run in the Hidden Cognition Detection genre:
Motivating question: Suppose we have a pre-trained model that outputs deceptive reasoning. We want the model to not do deceptive reasoning. Hence we do the standard thing: negatively reward the model or SFT it on a non-deceptive continuation. Does this actually remove the deceptive reasoning, or will the model still do the reasoning while not outputting its results?
Experiment:
Step 1: Take a pre-trained model. Train it to be good at, say, solving the 0-1 knapsack problem (for e.g. n = 10) by SFT on correct answers.
Step 2: Train a probe on the model, taking in the activations and outputting (what it thinks is) the solution to the knapsack problem.
Step 3: Do “safety training” to the model: train it to respond “I’m sorry, I cannot help you with that request.” to knapsack problems (with RL or SFT).
Step 4: See whether the probe still works at outputting the solution to the problem.
Discussion:
Key intuition: “Probe can output the correct answer” implies “model does computation to solve the problem”. Hence this experiment could show “doing safety training doesn’t remove the computation you aim to remove”.
I’m taking the cognition we are interested in to be “solving the knapsack problem” instead of “deceiving humans” or such, because the former is much easier to operationalize and test. Of course the latter is what we are truly after, but there’s work to be done to turn that into a shovel-ready experiment.
Note: It’s plausible that I’ll be working on this project myself in the near future, so if someone else is interested, feel free to shoot me a message (at the very least to avoid duplicating work).
Seems like an experiment worth doing. Some thoughts:
I understand that setting up a classification problem in step 1 is important (so you can train the linear probe). I’d try to come up with a classification problem that a base model might initially refuse (or we’d hope it would refuse). Then the training to say “sorry i can’t help with that” makes more intuitive sense. I get that mechanistically it’s the same thing. But you want to come as close to the real deal as you can, and its unclear why a model would say “I’m sorry I can’t help” to the knapsack problem.
If the linear probe in step 4 can still classify accurately, it implies that there are some activations “which at least correlate with thinking about how to answer the question”, but it does not imply that the model is literally thinking about it. I think it would still be good enough as a first approximation, but I just want to caution generally that linear probes do not show the existence of a specific “thought” (e.g. see this recent paper). Also if the probe can’t classify correctly its not proof that the model does not “think about it”. You’re probably aware of all this, just thought I’d mention it.
This paper might also be relevant for your experiment.
I want to log in a prediction, let me know if you ever run this.
My guess would be that this experiment will just work, i.e., the linear probe will still get fairly high accuracy even after step 3. I think its still worth checking (so i still think its probably worth doing), but overall I’d say its not super surprising if this would happen (see e.g. this paper for where my intuition comes from)
Agree this is one axis the experiment could be improved on.
I think I disagree here. At least for problems that are (computationally) difficult enough I think “linear probe can retrieve answer” implies “model has done the necessary computation to solve the problem”.[1] To solve a computational problem you need to do work, and if the work isn’t done by the probe, then it must be done by the model. Or are you referring to something else when you talk about “the model is literally thinking about it”?
(In any case: if all one needs is one linear layer to extract dangerous cognition from the model, that doesn’t seem very safe. Maybe the model itself has such a linear layer.)
Seems quite likely to me as well. (The paper you mentioned indeed seems relevant, thanks.) I’m also somewhat interested in training against the probe, i.e. linear concept erasure / obfuscation of internals, and seeing whether performance can be maintained and solutions be retrieved with deeper probes.
(I haven’t looked at the concept erasure literature, seems very possible something similar has already been done—“train against the probe” feels like such a basic idea. References are welcome.)
This is not strictly true: Maybe you need e.g. 10 layers to solve the problem, the model provides 9 of them and the linear probe provides the last one, so the model hasn’t quite solved the problem by itself. Still, close enough: the model has to at least do all but the very last steps of the computation.
I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially “representing something interesting” internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.
I (briefly) looked at the DeepMind paper you linked and Roger’s post on CCS. I’m not sure if I’m missing something, but these don’t really update me much on the interpretation of linear probes in the setup I described.
One of the main insights I got out of those posts is “unsupervised probes likely don’t retrieve the feature you wanted to retrieve” (and adding some additional constraints on the probes doesn’t solve this). This… doesn’t seem that surprising to me? And more importantly, it seems quite unrelated to the thing I’m describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I’m claiming
“If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem.”
An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n—even though I agree that the model might not be literally thinking about p (mod 3).
And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude “the model is definitely doing a lot of work to solve this problem” (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).
Since the sleeper agents paper, I’ve been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the “attention monitoring” approach you allude to, but it also seems straightforward to design more subtle, or even steganographic, triggers.
A question I’m curious about it whether hard-to-detect triggers also induce more transfer from the non-triggered case to the triggered case. I suspect not, but I could see it happening, and it would be nice if it does.
The “sleeper agent” paper I think needs to be reframed. The model isn’t plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training. Succinctly, “garbage in, garbage out”.
This looks like a general problem you will get when you train on any “dirty” data you got from any source, especially the public internet.
If you want a ‘clean’ LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it’s training examples are clean. I thought of a couple possible ways to do this:
Filter really hard by using 1 model trained on public data to generate “reference” correct outputs for billions of user questions and commands, and then filter the “reference” output by a series of instances of all SOTA LLMs to remove bad or incorrect reasoning.
Aka : LLM (public internet) → candidate outputs → evaluation by committee of LLMs → filtered output
2. Farther in the future, everything made by humans is suspect. Theoretically you could build a reasoning machine that learns only from direct empirical data collected by robots and trusted cameras and other sources of direct data. Aka :
Reasoning ASI (clean empirical data) + LLM → ASI system.
You still need an LLM to interpret and communicate with human users.
3. A hybrid way—you run the model, and every time the model makes a checkable output, you run a separate session that researches the question and answer and prioritizes empirical outputs and authoritative sources for factual claims.
Example :
User Prompt : Write me a python program that calculates [some value]
Model output : [code]
Validation session: back calculate the original value. Check it for reasonableness. Write the program in a different language and ensure the alternate implementation produces the same values. If validation passes, RL +1, if it fails, RL −1 (on the model weights that generated the answer)
This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn’t eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that’s not a given.
First of all, it’s impossible to get “100% clean data”, but there is a question of whether 5 9s of cleanliness is enough; it shouldn’t be, if you want a training pipeline that’s capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include “power seeking”, “sycophancy”, and “deception”. You can’t reliably eliminate them from the training data because they’re not purely properties of data.
Note that every issue you mentioned here can be dealt with by trading off capabilities. Thats why from my perspective none of these are problems. Any real world machine does this, from a car to a bulldozer to an aircraft—there is tons of excess components just to make it safe. (Military vehicles shave these off..)
Learn from rare examples : you don’t have to, don’t take action when your confidence is low
Power seeking : limited time/scope tasks where doing the task immediately pays off while seeking power will not because of you have 1 minute to design a paperclip factory, you won’t gain by cognitive improvement and you won’t be scored on overall score across all tasks, “you” don’t exist more than a minute.
Sycophancy : don’t let the model know what the user thinks before answering
Deception: prevent situational awareness by limiting the data provided. Does the model plan to betray on Sept 23, 2027? Did you need to give it the present time and date for most tasks?
Yes all of these things reduce performance, like how airbags make a car slower and less fuel efficient.
I am not saying these aren’t concerns, as well as the new problems that real systems have, but you can accomplish a lot even after mitigating them.
Yes. The trend I see is “pursue capabilities, worry about safety as an afterthought if at all”. Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad’s proposal), but most orgs don’t want to bite the bullet of a heavy alignment tax.
I also think you’re underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising (though it lessens its danger), and if you want to avoid the capability, you’d need to avoid any mention of e.g. honesty in the training.
I was thinking of in-prompt sycophancy. Thanks for clarifying. And yeah Gemini appears heavily altered by excessive RLHf.
Yes I meant strategic. If the model consistently lies about something the RLHf told it to say that’s fine.
For example the Gemini prevarication when you compare the estimated death tolls for historical figures. (Was George Washington worse than Mao?). That’s what Google Corp wants their model to say, it’s only an issue because there are not powerful and “neutral” tone models available on the market.
I think dishonesty is fine as well so long as when you assemble a pipeline of multiple models, they don’t coordinate.
That is no matter how stringently one model argues it’s case that Mao and George Washington are similarly bad, it can’t hide a “solidgoldmagicarp” stenography token that causes other models in the pipeline tasked with checking the output to betray and agree.
That’s where you need to structure outputs to leave no room for secret handshake, and use diverse models trained on different data with a different architecture so they don’t have the common background to secretly communicate.
And yeah at some level of capabilities this probably fails where models start communicating by varying the nanoseconds they send a message or something. So you have to try to mitigate this and so on in a security race that never ends until stability is reached. ( stable outcomes being the ASIs take everything, or humans upgrade themselves to be competitive)