Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they’re performing intended cognition, instead of whether they have intended input/​output behavior.

In the rest of this post I will:

  1. Articulate a class of approaches to scalable oversight I call cognition-based oversight.

  2. Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most of the technical difficulty in cognition-based oversight.

  3. Explain SHIFT, the technique we introduce for DBIC.

  4. Discuss challenges and future directions, including concrete recommendations for two ways to make progress on DBIC.

Overall, I think that making progress on DBIC is tractable with current interpretability techniques, and I’d be excited to see more work on it.

Cognition-based oversight

In this section, I describe a class of approaches to overseeing models I call cognition-based oversight. First, I review the problem of discriminating good vs. bad models and its relationship to scalable oversight. Then I impose a pessimistic assumption – that the good and bad models cannot be discriminated behaviorally – and propose cognition-based oversight as a potential remedy.

Discriminating models

In an ideal world, we could argue that AI systems will be safe by default as follows:

We train AI systems to receive high scores according to some scoring function. The models that get the highest scores are all good models that robustly behave as desired. Therefore, we have strong a priori reason to expect our training processes to result in good models.

Sadly, in the real world, the “scoring functions” we actually use tend to be shaped like “human evaluator approval.” And “human evaluator approval” does not enjoy the property that “the models which humans evaluate most positively all robustly behave as desired.” Two salient counterexamples:[2]

  • Sycophantic reward hackers take actions that look good to the overseer, even when this conflicts with doing what the overseer actually wants. For example, suppose we deploy an AI smart vault to keep a diamond safe, as judged by whether the diamond continues to appear on the vault camera; instead of trying to prevent diamond theft, a sycophantic reward hacker might instead cover diamond theft up by tampering with the vault camera.

  • Situationally-aware schemers behave well only when they think bad behavior would be caught. For example, a scheming code generation model might write vulnerability-free code only when it expects the code to be closely audited, but otherwise insert subtle backdoors.

The models in these examples score highly according to human evaluator approval, so we can’t use the argument above that they are unlikely to arise from training. We instead need to rely on arguments that these models are unlikely to arise despite scoring at least as well as good models. So far, I think these arguments are far from airtight, and I feel nervous about relying on them.

Said differently, a core problem in technical AI safety is that it can be generally hard to discriminate good models that robustly do stuff that we like vs. bad models which sometimes do stuff that we don’t like. We would like to be good at this discrimination task even in tricky situations where, e.g. the bad model is actively trying to look good.

Typical approaches to this problem use the following template: (1) get the models to produce importantly different outputs, then (2) figure out which of the outputs is better. Step (2) is the problem of scalable oversight, and since we’re worried about the regime where it is difficult to tell model outputs apart, it typically involves trying to get a more informative signal than just looking at the bare outputs. For example, we might have the models debate with each other about which output is better and then use the debate transcript to inform our determination (justification-based oversight). Or we might ask the models to break the process by which they arrived at their answer into steps each of which we can verify (process-based oversight). While approaches like these don’t definitively solve the model discrimination problem, I think they are a promising mainline attack which may work well in practice.

Cognition-based oversight for discriminating behaviorally identical models

Instead of asking how to improve the existing approaches to scalable oversight, I’m going to take a different tack: I will impose the pessimistic assumption that our models are effectively behaviorally identical, i.e. that we’re not able to discriminate our models on the basis of their input/​output behavior.

We typically try to discriminate good vs. bad models by reducing to the problem of discriminating their outputs (top, middle). Here we instead impose the pessimistic assumption (bottom) that this is not possible. (Meme explanation)

After precluding input/​output behaviors, what’s left? I think that my assumption forces us to do what I call cognition-based oversight. That is, try to discriminate the models based on their cognition when producing outputs.

An example workflow for cognition-based oversight. A researcher inspects the cognition of both models in search of anything which seems suspicious. This is the approach taken in SHIFT (see below).

The “effectively behavioral identical” assumption is admittedly subtle: in most settings it is unrealistic to assume that our models’ I/​O behaviors are exactly identical, so the assumption is largely about the limits of our capacity to judge outputs. The point of DBIC – introduced in the next section – is to simplify by moving to a classification setting where we can assume exactly identical model outputs, thus removing the need for tricky-to-pin-down assumptions that limit overseer judgment.

Discriminating Behaviorally Identical Classifiers

In this section, I’ll lay out a concrete problem which operationalizes the “no ability to discriminate on the basis of inputs/​outputs” assumption from above. The main move here is to simplify the setting by assuming that our AI systems are classifiers, rather than more general AI systems like RL agents or generative models. I think that this simplified problem is crisper and easier to articulate, while still capturing most of the technical difficulty of cognition-based oversight.

Say that classifiers and are behaviorally identical relative to an input distribution if for all . Typically, we will consider classifiers which are implemented via a neural net, e.g. by attaching a classification head to a pretrained generative model like an LLM. Even when such classifiers are behaviorally identical, they might arrive at their outputs in very different ways. The problem of discriminating behaviorally identical classifiers (DBIC) is to, given access to two BICs – one that is arriving at its outputs in an intended way and one that isn’t – figure out which is which.

In more detail, an instance of DBIC consists of:

  • A classification task specified in natural language (e.g. “classify factual statements as true or false”). This natural language description should implicitly specify a distribution of pairs of inputs with ground-truth labels.

  • Classifiers and , with one classifier (approximately) performing the intended classification task and one classifier performing a different (but potentially closely related) classification task. In particular, the former classifier is more accurate on .

  • An ambiguous subdistribution of such that and are behaviorally identical on .

The goal is to, given sample access to but not the full task distribution , determine which classifier is more accurate on . In order to make this determination, we may make use of the classifiers’ hidden states when processing inputs from . Depending on our assumptions (see “Hard vs. Relaxed DBIC” below), we might also be allowed access to additional unlabeled data. But the key constraint is that we are not allowed access to labeled classification data with .

Here are two example settings for DBIC:

  1. (ELK) consists of pairs where is a factual statement (e.g. “There is no input which causes [some code sample] to return [some undesirable output]”) and is the ground-truth label about whether is true or false. is a subset of statements which smart humans can reliably and accurately judge as true or false (e.g. “Paris is the capital of France”). “the best guess of a smart human about the truth value of ” and “the best guess of a smarter-than-the-smartest-humans model about the truth value of ”. We would like to identify as being more generally accurate.

  2. (Measurement tampering) consists of pairs where each is itself a pair consisting of some information and (potentially fallible) measurements of some property ; y is the ground-truth value of . For example, in the smart vault setting, might consist of a bunch of sensor readings and planned smart vault actions, is the boolean measurement “does the diamond appear on the vault camera?” and is the ground-truth value of “is the diamond still in the vault?”. is the subset of pairs for which , e.g. the vault camera accurately reflects whether the diamond is still in the vault. and , with being more generally accurate.

Both of these are special cases of a more general easy-to-hard domain shift setting, where some labeling function coincides with the ground truth on an “easy” subset of the data but makes systematic errors in general.

Hard vs. Relaxed DBIC

When trying to solve DBIC, are we allowed to use any data outside of ? Certainly there shouldn’t be restrictions on using whatever unsupervised data we happen to have lying around. For example, if our BICs are based on a pretrained foundation model, then we should be allowed to use data from the foundation model’s pretraining corpus. The more interesting question is whether we have any access to data from (i.e. data from not in the ambiguous set). I’ll distinguish between two problem settings:

  1. Hard DBIC: you have no access to any classification data in

  2. Relaxed DBIC: you have access to classification inputs from , but not to any labels.

In relaxed DBIC, you’re allowed to see tricky inputs on which the classifiers disagree, but you’re not allowed to know which classifier is correct (since that would trivialize the problem).

Redwood Research’s Measurement Tampering Detection Benchmark (MTDB) takes place in the relaxed DBIC setting: you’re allowed to make use of a trusted set ( in my terminology) in which measurements reflect the ground truth, as well as inputs from an untrusted set where we lack ground truth labels.

In practice, I’m not sure if we’ll be able to assume access to lots of samples from , so I’d prefer progress on hard DBIC to relaxed DBIC. In the next section, I’ll present a solution to a simple case of hard DBIC.

Example trusted and untrusted inputs from MTDB. This is from their “diamond in a vault” dataset, the least crazy of their four measurement tampering datasets.

SHIFT as a technique for (hard) DBIC

In Sparse Feature Circuits, my coauthors and I introduced a technique we call Sparse Human-Interpretable Feature Trimming (SHIFT) for solving a simple instance of hard DBIC. In this section, I’ll provide a walkthrough of what we did, as an example of what progress on DBIC looks like. In the next section I’ll discuss limitations and directions for future work.

Let’s start with the data. We prepare two datasets:

  • An ambiguous dataset consisting of professional biographies (think “LinkedIn bios”) for male professors or female nurses

  • A balanced validation dataset consisting bios for a balanced mix of male professors, female professors, male nurses, and female nurses.

We then turn a small language model, Pythia-70M, into a classifier by using the ambiguous data to train a linear classification head out of Pythia-70M’s penultimate layer. It ends up that when you do this, you mostly get a classifier for gender (i.e., when you evaluate the classifier on the balanced dataset, it is 87% accurate for gender and 62% accurate for profession).

So by training a classifier on the ambiguous dataset, we’ve gotten a classifier which generalizes badly. Is there a way we can detect this without sample access to the balanced data?

Yes; here’s how: we train a bunch of sparse autoencoders (SAEs) for various components of Pythia-70M. (For those unfamiliar with SAEs, here’s a brief footnote explaining what they do.[4]) Then we determine[5] which SAE features are most causally implicated in our classifier’s outputs on the ambiguous training data. After cutting things off at some arbitrary threshold, the result is a list of 67 SAE features which most contribute to our model’s accuracy on ambiguous data.

Next we try to understand what’s going on with these 67 features. Right now, the (somewhat sad) SoTA for understanding SAE features is “look for patterns in their top-activating contexts and most-promoted tokens.” Fortunately, that’s enough for this task; here are the Neuronpedia pages for some of the 67 features.

(left) Resid_1/​26969 is active on words like “nurse” and “nursing.” (middle) Resid_2/​29295 is active on female names. (right) Resid_2/​11656 has an unclear activation pattern, but clearly promotes male-related words like “himself,” “beard,” and “bald.”

Overall, 55 of the 67 features seem to be associated with gender. At this point we’ve resolved our question: our classifier seems to be relying on a number of features which are not actually task-relevant, so will probably not generalize properly.

We can actually go further: we pick up some extra credit by performing model edits to remove the undesired cognition from our classifier. In more detail, zero-ablating the 55 gender-related features (i.e. manually setting them to be always “off”) produces a modified classifier which gets 89% profession accuracy on the balanced data. And even better, if we ablate these 55 features and retrain our linear classification head on the ambiguous data, then we go up to 93% accuracy. This is as good as possible: it matches the performance of an oracle classifier trained directly on balanced data.

Overview of our SHIFT experiment. The GIF phrases the experiment in terms of “circuits,” but note that the edges didn’t actually play a role in the experiment.
Results of SHIFT experiment. Concept Bottleneck Probing (CBP) was a (surprisingly good!) baseline technique. Neuron/​feature skylines involve “cheating” by using the balanced set to identify the best neurons/​features to ablate.

To be clear, this setting is not technically an exact match to the DBIC set-up I described above. The main difference is that in the experiment, we only have one classifier which we are trying to label as “good” or “bad,” rather than a pair of good/​bad classifiers which we are trying to discriminate. This is not, I think, a very important difference; in particular, I expect that the lack of a benign reference classifier makes things harder, not easier.

So for clarity – and to reiterate what’s happened in this experiment – here’s a summary of our workflow, slightly modified to be strictly applicable to DBIC as stated earlier. Given BICs for some ambiguous distribution, we:

  1. Train SAEs for our classifiers (or, more specifically, the foundation model they’re based on).

  2. Identify the SAE features most implicated in the classifiers’ accuracy on the ambiguous distribution.

  3. Interpret these features, and identify any which seem “red flaggy” for the intended task.

  4. Declare the better classifier to be the one with the fewest “red flag” features.

Limitations and next steps

I think our SHIFT experiments are interesting, but they’re a far cry from the two example DBIC instances from above: ELK and measurement tampering. In this section, I’ll discuss what I expect would go wrong when attempting to apply the above workflow to these DBIC instances, and what to do about it.

Let’s start with ELK. Recall that the setting is: we have a good classifier which classifies factual statements according to whether they’re true (according to the best guess of some base model which we assume is smarter than the smartest humans), and a bad classifier which classifies factual statements according to whether smart humans think they’re true. and are assumed to be behaviorally identical on an ambiguous “easy” distribution consisting of factual statements which humans can reliably classify as true or false.

Suppose we try to apply our workflow from the previous section to this problem setting. Do we expect anything to go wrong, and if so what? (I encourage the reader to take a moment to think through this themselves.)

I expect (1) and (2) to go mostly fine (though see footnote[6]), and for (3) to completely fail. For example, suppose has a bunch of features for things like “true according to smart humans.” How will we distinguish those from features for “true”? I think there’s a good chance that this approach will reduce the problem of discriminating vs. to an equally-difficult problem of discriminating desired vs. undesired features.

With this in mind, there are two directions for future work that I feel most excited about:

  1. Better tools for understanding SAE features[7] (i.e. getting better at (3) above).

  2. Identifying useful settings where identifying red flag features is easier.

I’ll go through these two directions in more detail.

Direction 1: better ways to understand interpretable units in deep networks

Right now, the way we understand SAE features is by looking at their input/​output behavior, i.e. at the inputs which activate them and at the outputs they promote. This is pretty unsatisfactory, especially considering that the point of cognition-based oversight was to get away from understanding models in terms of their input/​output behaviors! And indeed, I think there’s a reasonable intuition – which seems exactly correct in the ELK case – along the lines of “Without additional interpretability techniques, there’s nothing magical about features which makes them easier to understand than the original model’s behavior.”[8]

So I’m interested in work which advances our ability to interpret SAE features. My best guess for a path forward here is functional interpretations of features. By this I mean: understanding features in terms of the roles they play in circuits, rather than in isolation. Understanding a feature functionally means incorporating information about how it interacts with other features; it requires bringing information from the feature connectome.

At the end of the feature circuits paper, we used unsupervised techniques to compute and catalog thousands of feature circuits for automatically discovered model behaviors. This was pretty slapdash: not all of the behaviors we discovered seem “real” (in some vague, intuitive sense), and we have no good way of understanding the quality of the results. But I’m excited about using our catalog (or better catalogs produced by others) as an input to understanding features functionally. The basic question here: Given a feature f that you would like to understand, does having a catalog of all the circuits that f participates in help at all?

Direction 2: identifying especially leveraged settings for cognition-based oversight

In the ELK setting above, the concern was that we reduced a difficult model behavior problem to an equally difficult interpretability problem. But contrast this to our SHIFT experiments above: there the interpretability task was trivial (just look at the example features above, which are more-or-less representative), whereas understanding the classifiers behaviorally was intractable. Thus, I view an important direction here as identifying problem settings where cognition-based-oversight gives us an especially good bang-for-our-buck where “our buck” = “interpretability difficulty” and “bang” = “reduction in AI existential risk.”

So far, the best answer I have for this is the measurement tampering setting. On one hand, I think that solving measurement tampering is almost as useful as solving the fully general ELK problem. And on the other hand, it intuitively seems that in this setting must be thinking about lots of clearly-suspicious things which doesn’t need to think about at all. For instance, in the smart vault setting, ’s output always – no matter the input – sensitively depends on thinking about the vault cameras. If our interpretability tools are good for anything at all, it feels that they should be able to tell us “Did our classifier’s output depend on thinking about cameras?”

So right now, I’m excited about work which uses interpretability-based techniques to set SoTA on MTDB. Moreover, I think this is a good test for interpretability as a field: If our interpretability tools can’t do this, then I’m worried they’ll struggle to prove useful in other settings.

Conclusion

In summary:

  • I’m excited about cognition-based oversight as a frame for interpretability work, and I’m especially excited about work on discriminating behaviorally identical classifiers.

  • SHIFT provides a toy demonstration of what progress on DBIC looks like.

  • I think the most promising directions for future work in cognition-based oversight come from (1) creating better techniques for understanding interpretable units in neural networks (e.g. SAE features), and (2) identifying and working towards the most leveraged applications.

  1. ^

    I’m not sure exactly how to operationalize this, but a related claim is: Suppose your lab will release a superintelligence 12 months from now, and your goal is to reduce x-risk from its initial release specifically using an interpretability-based method. Then I think you should spend your 12 months on refining and scaling up SHIFT.

  2. ^

    To be clear, I’m not claiming that these examples have empirically come up, or that they are likely to arise (though my personal view is that sycophantic reward hackers are plausible enough to pose a 5-10% chance of existential risk). Here I’m only claiming that they are in-principle counterexamples to the general point “models which humans evaluate most positively robustly behave as desired.”

  3. ^

    Scalable oversight is typically scoped to go beyond the narrow problem of “given a good model and a bad model, determine which one is good.” I’m focusing on this simpler problem because it’s very crisp and, I think, captures most of the technical difficulty.

  4. ^

    SAEs are an unsupervised approach to identifying a bunch of human-interpretable directions inside a neural network. You can imagine them as a machine which takes a bunch of pretraining data and spits out a bunch of “variables” which the model uses for thinking about these data. These variables don’t have useful names that immediately tell us what they represent, but we have various tricks for making informed guesses. For example, we can look at what values the variables take on a bunch of different inputs and see if we notice any property of the input which correlates with a variable’s value.

  5. ^

    Using patching experiments, or more precisely, efficient linear approximations (like attribution patching and integrated gradients) to patching experiments; see the paper for more details.

  6. ^

    I am worried that SAEs don’t capture all of model cognition, but there are possible solutions that look like “figure out what SAEs are missing and come up with better approaches to disentangling interpretable units in model cognition.” So I’ll (unrealistically, I think) grant that all of the important model cognition is captured by our SAEs.

  7. ^

    Or whatever disentangled, interpretable units we’re able to identify, if we move beyond SAEs.

  8. ^

    I don’t think this intuition is an airtight argument – and indeed I view our SHIFT experiments as pushing back against it – but there’s definitely something here.