How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme
Introduction
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.
In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.
Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
Disclaimers: I have tried to make this post concise, at the cost of not making the full arguments for many of my claims; you should treat this as more of a rough sketch of my views rather than anything comprehensive. I also frequently change my mind – I’m usually more consistently excited about some of the broad intuitions but much less wedded to the details – and this of course just represents my current thinking on the topic.
Problem
I would feel pretty optimistic about alignment if – loosely speaking – we can get models to be robustly “honest” in a way that scales even to superhuman systems.[1] Moreover, I think a natural sub-problem that captures much or most of the difficulty here is: how can we make a language model like GPT-n “truthful” or “honest” in a way that is scalable? (For my purposes here I am also happy to make the assumption that GPT-n is not actively deceptive, in the sense that it does not actively try to obscure its representations.)
For example, imagine we train GPT-n to predict news articles conditioned on their dates of publication, and suppose the model ended up being able to predict future news articles very well. Or suppose we train GPT-n to predict the outcomes of particular actions in particular situations, all described (imperfectly by humans) in text. Then I would expect GPT-n would eventually (for large enough n) have a superhuman world model in an important sense. However, we don’t currently know how to recover the “beliefs” or “knowledge” of such a model even in principle.
A naive baseline for trying to make GPT-n truthful is to train it using human feedback to output text that human evaluators believe to be true. The basic issue with this is that human evaluators can’t assess complicated claims that a superhuman system might make. This could lead to either competitiveness problems (if GPT-n only outputs claims that humans can assess) or misalignment issues (if GPT-n outputs false claims because human evaluators can’t assess them correctly).
In many ways this problem is similar to Eliciting Latent Knowledge (ELK), but unlike ELK I am happy to take a “non-worst-case” empirical perspective in studying this problem. In particular, I suspect it will be very helpful – and possibly necessary – to use incidental empirical properties of deep learning systems, which often have a surprising amount of useful emergent structure (as I will discuss more under “Intuitions”).
On the other hand, if we want to study scalable alignment empirically, I think it’s very important for us to also have good reason to believe that our experiments will say something meaningful about future models – and it’s not immediately clear how to do that.
This raises the question: how do we even approach doing research on this sort of problem, methodologically?
Methodology
I worry that a lot of theoretical alignment work is either ungrounded or intractable, and I worry that a lot of empirical alignment work doesn’t address the core challenge of alignment in the sense that it won’t scale to superhuman models. I would really like to get the best of both worlds.
But what would it even mean to have an empirical result for current (sub-human or human-level) models and believe that that result will also apply to future (super-human) models? For example, if I have a method that seems to make GPT-3 truthful, what would make us believe that it should probably also scale to GPT-n for much larger n?
I think the biggest qualitative difference between GPT-3 and GPT-n (n >> 3) from an alignment perspective is that the GPT-3 is at most human-level, so human feedback is more or less sufficient for alignment, while GPT-n could be very superhuman, so naive human feedback is unlikely to be sufficient. In other words, I think the biggest technical challenge is to develop a method that can generalize even to settings that we can’t supervise.
How can we empirically test than an alignment scheme generalizes beyond settings that we can supervise?
I think there are at least a few reasonable strategies, which I may discuss in more detail in a future post, but I think one reasonable approach is to focus on unsupervised methods and show that those methods still generalize to the problems we care about. Unlike approaches that rely heavily on human feedback, from the perspective of an unsupervised method there is not necessarily any fundamental difference between “human-level” and “superhuman-level” models, so an unsupervised method working on human-level examples may provide meaningful evidence about it working on superhuman-level examples as well.
That said, I think it’s important to be very careful about what we mean by “unsupervised”. Using the outputs of a raw pretrained language model is “unsupervised” in the weak sense that such a model was pretrained on a corpus of text without any explicitly collected human labels, but not in the stronger sense that I care about. In particular, GPT-3’s outputs are still essentially just predicting what humans would say, which is unreliable; this is why we also avoid using model outputs in our paper.
A more subtle difficulty is that there can also be qualitative differences in the features learned by human-level and superhuman-level language models. For example, my guess is that current language models may represent “truth-like” features that very roughly corresponding to “what a human would say is true,” and that’s it. In contrast, I would guess that future superhuman language models may also represent a feature corresponding to “what the model thinks is actually true.” Since we ultimately really care about recovering “what a [future superhuman] model thinks is actually true,” this introduces a disanalogy between current models and future models that could be important. We don’t worry about this problem in our paper, but we discuss it more later on under “Scaling to GPT-n.”
This is all to point out that there can be important subtleties when comparing current and future models, but I think the basic point still remains: all else equal, unsupervised alignment methods are more likely to scale to superhuman models than methods that rely on human supervision.
I think the main reason unsupervised methods haven’t been seriously considered within alignment so far, as far as I can tell, is because of tractability concerns. It naively seems kind of impossible to get models to (say) be honest or truthful without any human supervision at all; what would such a method even look like?
To me, one of the main contributions of our paper is to show that this intuition is basically incorrect and to show that unsupervised methods can be surprisingly effective.
Intuitions
Why should this problem – identifying whether a model “thinks” an input is true or false without using any model outputs or human supervision, which is kind of like “unsupervised mind reading” – be possible at all?
I’ll sketch a handful of my intuitions here. In short: deep learning models learn useful features; deep learning features often have useful structure; and “truth” in particular has further useful structure. I’ll now elaborate on each of these in turn.
First, deep learning models generally learn representations that capture useful features; computer vision models famously learn edge detectors because they are useful, language models learn syntactic features and sentiment features because they are useful, and so on. Likewise, one hypothesis I have is that (a model’s “belief” of) the truth of an input will be a useful feature for models. For example, if a model sees a bunch of true text, then it should predict that future text will also likely be true, so inferring and representing the truth of that initial text should be useful for the model (similar to how inferring the sentiment of some text is useful for predicting subsequent text). If so, then language models may learn to internally represent “truth” in their internal activations if they’re capable enough.
Moreover, deep learning features often have useful structure. One articulation of this is Chris Olah’s “Features are the fundamental unit of neural networks. They correspond to directions.” If this is basically true, this would suggest that useful features like the truth of an input may be represented in a relatively simple way in a model’s representation space – e.g. possibly even literally as a direction (i.e. in the sense that there exists a linear function on top of the model activations that correctly classifies inputs as true or false). Empirically, semantically meaningful linear structure in representation space has famously been discovered in word embeddings (e.g. with “King—Man + Woman ~= Queen”). There is also evidence that this sort of “linear representation” may hold for more abstract semantic features such as sentiment. Similarly, self-supervised representation learning in computer vision frequently results in (approximately) linearly separable semantic clusters – linear probes are the standard way to evaluate these methods, and linear probe accuracy is remarkably high even on ImageNet, despite the fact that these methods have never see any information about different semantic categories! A slightly different perspective is that representations induce semantically informative metrics throughout deep learning, so all else equal inputs that are semantically similar (e.g. two inputs that are true) should be closer to each other in representation space and farther away from inputs that are semantically dissimilar (e.g. to inputs that are false). The upshot of all this is that high-level semantic features learned by deep learning models often have simple structure that we may be able to exploit. This is a fairly simple observation from a “deep learning” or “representation learning” perspective, but I think this sort of perspective is underrated within the alignment community. Moreover, this seems like a sufficiently general observation that I would bet it will more or less hold with the first superhuman GPT-n models as well.
A final reason to believe that the problem I posed – identifying whether an input is true or false directly from a model’s unlabeled activations – may be possible is that truth itself also has important structure that very few other features in a model are likely to have, which can help us identify it. In particular, truth satisfies logical consistency properties. For example, if “x” is true, then “not x” should be false, and vice versa. As a result, it intuitively might be possible to search the model’s representations for a feature satisfying these sorts of logical consistency properties directly without using any supervision at all. Of course, for future language models, there may be multiple “truth-like” features, such as both what the model “truly believes” and also “what humans believe to be true”, which we may also need to distinguish, but there intuitively shouldn’t be too many different features like this; I will discuss this more in “Scaling to GPT-n.”
There’s much more I could say on this topic, but in summary: deep learning representations in general and “truth” in particular both have lots of special structure that I think we can exploit for alignment. Among other implications, this sort of structure makes unsupervised methods viable at all.
Our Paper
In our paper we introduce a method that was motivated by the intuitions described above. Our method, Contrast-Consistent Search (CCS), accurately classifies text as true or false directly from a model’s unlabeled activations across a wide range of tasks. We find that it is competitive with zero-shot prompting and performs well even in settings where model outputs are uninformative or misled (e.g. when we try prompting a model to output false answers).
For the sake of brevity, I won’t go into too many more details about our paper here; for more information, check out our summary on twitter or the paper itself.
But in this section I do want to clarify what I think our results do and do not provide evidence for from a scalable alignment perspective.
First, some things that I do NOT think our paper does:
Show that future language models will represent the truth of a statement in a simple (e.g. mostly linear) way in their activations even if humans don’t know the answer.
This is a prediction I would make (and which I will argue for in the next section), but not something our results say anything about.
Show that existing language models have beliefs in a very meaningful sense.
I suspect future language models will have beliefs in a more meaningful sense than current language models, but I don’t know in what sense exactly, and I don’t think this is necessarily essential for our purposes.
Show that we will be able to distinguish what a model “believes” to be true from what a human would say, in either current or future language models
Our paper doesn’t worry about this problem. But I think our method could be a useful component of a larger scheme that does address this issue too, as I’ll describe in the next section.
Show that we can recover superhuman knowledge from language models with our approach
I think we’re bottlenecked on good evaluation setups for this right now (but this is something I have been actively thinking about recently).
Show that our method can detect if a model is lying in cases where models might be actively deceptive, such as in Diplomacy or with a deceptively aligned agent
Again, I think we’re bottlenecked on good evaluation setups for this. But I also wouldn’t be surprised if we’d need to tweak the method to work in this sort of setting.
Provide a final, robust, polished method for this problem that is ready for actual applications
There are many ways our method can be improved. For example, it still gets random accuracy occasionally, and it seems like it might work less consistently in autoregressive models for reasons we don’t understand well (but which we also didn’t try hard to fix). I think there’s a lot of low-hanging fruit in improving it.
Some things that I DO think our paper does:
Show that we can classify examples (in some sort of QA-style format) as true or false with surprisingly high accuracy directly from LM hidden states without using any human supervision
I think this is surprising because before this it wasn’t clear to me whether it should even be possible to classify examples as true or false from unlabeled LM representations *better than random chance*! I think this is (loosely) kind of like showing that “unsupervised mind reading” is possible in many settings!
This works even in some cases where model outputs aren’t reliable – e.g. when we prompt a model to output false text – suggesting that it does something meaningfully different from just recovering what the model says.
But I’d still really like better evaluation setups for studying this in more detail (e.g. by having a setting where models lie more explicitly).
These results suggest to me that if future language models internally represent their “beliefs” in a way that is similar to how current language models represent “truth-like” features (which I will argue for in the next section), then our approach has a chance of finding those beliefs without being biased toward finding other truth-like features such as “what a human would say”.
If so, then I suspect we will be able to add additional unsupervised constraints to reliably push the solution our method finds toward the model’s “beliefs” rather than “what a human would say” or any other truth-like features a model might represent. (I will elaborate on this more in the next section.)
For me, one of the biggest upshots is that unsupervised methods for alignment seem surprisingly powerful and underexplored. (In general, I think on the margin other alignment researchers should feel less wedded to existing proposals – e.g. based on human feedback – and explore other totally different approaches!)
I’ve suggested a few times that I think a (refined) version of our approach could potentially serve as a key component of a scalable alignment proposal. I’ll elaborate on this next.
Scaling To GPT-n
Our method, CCS, seems to work well with current models when evaluated on human-level questions, but as literally stated above I don’t think it is too likely to find what GPT-n “actually believes” for at least a few possible reasons.
Worry 1: The proposed method just isn’t reliable enough yet.
Worry 2: Even if GPT-n develops “beliefs” in a meaningful sense, it isn’t obvious that GPT-n will actively “think about” whether a given natural language input is true. In particular, “the truth of this natural language input” may not be a useful enough feature for GPT-n to consistently compute and represent in its activations. Another way of framing this worry is that perhaps the model has superhuman beliefs, but doesn’t explicitly “connect these to language” – similar to how MuZero’s superhuman concepts aren’t connected to language.
Worry 3: Assuming GPT-n is still trained to predict human text, then even if Worry 2 isn’t a problem, GPT-n will presumably still also represent features corresponding to something like “what a human would say”. If so, then our method might just find those features, when that’s what we want to avoid. So we still need a way to ensure that our method finds the model’s “beliefs” rather than human beliefs or any other “truth-like” features.
Worry (1) doesn’t seem like a big deal to me; I think of our current method as a prototype that this sort of thing is possible at all – and a surprisingly good prototype at that – but that there’s a lot of low hanging fruit in improving it. In more ways than one, it is definitely not meant to be the final method. In general I expect more iterative experimental refinement to be necessary to make it practical and robust.
I think Worries (2) and (3) are more serious potential issues, but I suspect we can deal with each of them too, as I’ll describe now.
Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False
While current LMs seem to have features correlated with the truth of human-level inputs, this isn’t too surprising; it should be useful for these models to represent what humans would think or say. But what if we have a very superhuman GPT-n, and we give it an input that it (but no human) knows the answer to?
Hypothesis: GPT-n will internally represent the truth of (even superhuman) inputs in an analogous way to how current LMs represent features correlated with the truth of (human-level) inputs; or it will be easy to modify GPT-n to make this true (e.g. by prompting it appropriately, or tweaking how it is trained).
I will briefly sketch a couple arguments/intuitions for this hypothesis.
Claim 1: Suppose you hypothetically had a bunch of very difficult questions q_1, …, q_n that the model “knows” the answer to but which humans don’t know the answer to, along with the those true answers a_1, …, a_n. Suppose you created a few-shot prompt using these questions and ground truth answers as the demonstrations. Then I would predict that the model will continue to generate correct (superhuman) answers to new (similarly difficult, superhuman) questions.
This prediction seems quite likely to me – either by default or with slight tweaks to default GPT-style training. If a sufficiently good next-token predictor sees a bunch of correctly-answered questions, it should probably continue to predict that future similar questions will be correctly answered as well, assuming it “knows” the answer to those questions in basically ~any reasonable sense of the word “know”.
Claim 2: If Claim (1) holds, then GPT-n will learn to internally represent the truth of an input. In particular, the truth of an input is a property GPT-n can infer about the speaker that helps it predict subsequent tokens about these superhuman inputs (similar to how representing the sentiment of text is for predicting future text).
I think there are more subtleties in this Claim, and there are some interesting empirical questions about how LMs (even today) represent different “perspectives” or “personas” internally, but overall my intuition is that this prediction will hold, or that it’s at least possible to use/modify GPT-n in a way that makes it hold. For example, you could prompt GPT-n so that it thinks there’s a non-negligible chance that the text it is modeling was generated by a truthful superhuman speaker (e.g. maybe it models the text as a mixture distribution over different personas, infers the most likely personas, and simulates each persona that has a high enough probability of having generating the text), so that the model would benefit from simulating that speaker.
But overall I feel more strongly that there exists some reasonable and not-too-complicated way to make a generative model internally represent its belief about whether a natural language input is true or false; in particular, it seems much easier to me to get a model to think about whether a natural language statement is true than to get it to output whether that statement is true, and the former is sufficient for our purposes.
Why I Think We Will Be Able To Distinguish GPT-n’s “Beliefs” From Other Truth-Like Features
Assuming Worry (2) isn’t an issue, I think Worry (3) – that even if GPT-n represents its beliefs in its activations, our method might still find a different “truth-like” feature represented in the activations, such as “what a human would say” – is indeed likely a problem by default. But I suspect it is solvable by leveraging additional unsupervised properties.
Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
For example, on a random question and answer, it might be helpful for the model to think about “is this input actually true or false” and “would a human think this is true or false”, but it probably wouldn’t be very helpful for the model to actively think about “would Bill Gates in particular think this input is true?” – another “truth-like” feature the model might be able to represent – unless (say) you explicitly prompted it to ask what Bill Gates would think.
Hypothesis 2: We can find and enumerate all of the “truth-like” features consistently represented by GPT-n (using something like our method)
Loosely speaking, I think of CCS as being able to find “candidate” truth-like features in a model (without having any bias toward e.g. “what a human would say”). As a result, I think we could modify the method to search for *all* features in a model satisfying logical consistency properties (e.g. perhaps by first finding one truth-like feature, then finding another truth-like features that achieves low loss which is decorrelated with the first feature, etc.).
Hypothesis 3: One of the few “truth-like” features GPT-n will represent is whether an input is actually true or false. This would mean that we only need a small number of remaining bits to identify its beliefs from all the “candidate” features we enumerated in Hypothesis (2).
I made the case for the first claim in the previous section, “Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False.” The second claim follows immediately. (Also note that I intuitively usually imagine between 1-3 bits remaining at this stage, but even something like 10 bits – 1024 remaining perspectives to distinguish between – probably isn’t very different.) It’s worth emphasizing that a few bits really isn’t much – I think it’s possible that getting to this stage is almost the full problem.
Hypothesis 4: We can specify those few remaining bits in an unsupervised way.
I’ll describe a couple different ways we could specify remaining bits that cover different cases. These aren’t comprehensive, but hopefully they communicate how I can imagine doing this.
First, suppose GPT-n literally just has a “what a human would say” feature and a “what do I [as GPT-n] actually believe” feature, and those are the only two consistently useful truth-like features that it represents, and that using our method we can find both of them. This means we literally only need one more bit of information to identify the model’s beliefs.
One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50⁄50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer.[2] This would allow us to identify the model’s beliefs from among these two options.
A second possible situation is if GPT-n also represents “what a misaligned AI system would say.”[3] This could be more difficult to distinguish from the model’s beliefs, but I think it would still be feasible. For the sake of brevity I won’t go into too many details here, but I’ll try to concisely sketch one possible approach. Concretely, suppose you literally have complete access to both “what GPT-n believes” (call this “aligned-GPT”) and “what GPT-n thinks a misaligned AI system would say”[4] (call this “misaligned-GPT”), but you don’t know which is which and you want to identify which is aligned-GPT.
First note that we would then know exactly when misaligned-GPT “lies”: it outputs something false exactly when it disagrees with aligned-GPT. Hypothesis: we can force our simulation of misaligned-GPT to be “aware” of when it lies – in the sense that it internally represents features correlated with when it lies – because it needs to be “deliberate” in its lying. If so, then it should be easier to predict when misaligned-GPT lies (i.e. when it disagrees with aligned-GPT) using misaligned-GPT’s representations than using aligned-GPT’s representations. We can directly measure this and use it to distinguish between them.
There is much more I could say on this topic – this is just a very rough sketch – but hopefully it provides some more intuition for why I think unsupervised properties can be a generally powerful tool for identifying what models believe.
Conclusion
As I mentioned at the beginning, I don’t feel wedded to many of the details in this post. The method we proposed in our paper isn’t perfect, and the specific conceptual proposals I outlined in this post presumably aren’t literally what we would do in practice – they are based on a number of tentative hypotheses, some of which will likely turn out to be wrong in some way, and they are likely excessively complicated in some respects.
But I still find it extremely valuable to sketch out what a full solution could plausibly look like given my current model of how deep learning systems work. In practice I use the sorts of considerations in this post to help guide my empirical research; this post sketches a general type of approach that I am currently very excited about, and hopefully communicates some of the most important intuitions that guide my agenda.
This covers only a fraction of my main ideas on the topic, but I’ll likely have additional write-ups with more details in the future. In the meantime, please let me know if you have any comments, questions, or suggestions.
I’m grateful to Beth Barnes, Paul Christiano, and Jacob Steinhardt for many helpful discussions.
- ^
Whenever I talk about “truthfulness” or “honesty” in models, I don’t have any strong philosophical commitments to what these mean precisely. But I am, for example, more or less happy with the definition in ELK if you want something concrete. That said, I ultimately just care about the question pragmatically: would it basically be fine if we acted as thorough those outputs are true? Moreover, I tend to prefer the term “honest” over “truthful” because I think it has the right connotation for the approaches I am most excited about: I want to recover what models “know” or “believe” internally, rather than needing to explicitly specifying some external ground truth.
- ^
Beth Barnes originally proposed this idea.
- ^
I could also imagine there being ways of training or prompting GPT-n so it doesn’t represent this as naturally in the first place but still represents its beliefs.
- ^
I think this would probably be a simulation of a “non-adapative” misaligned system, in the sense that it would not be “aware” of this alignment proposal, because of how we extract it from a feature that is used by GPT-n independent of this proposal.
- Cyborgism by 10 Feb 2023 14:47 UTC; 337 points) (
- Nobody’s on the ball on AGI alignment by 29 Mar 2023 14:26 UTC; 327 points) (EA Forum;
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 325 points) (
- AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work by 20 Aug 2024 16:22 UTC; 222 points) (
- What Discovering Latent Knowledge Did and Did Not Find by 13 Mar 2023 19:29 UTC; 166 points) (
- Nobody’s on the ball on AGI alignment by 29 Mar 2023 17:40 UTC; 99 points) (
- Review of AI Alignment Progress by 7 Feb 2023 18:57 UTC; 72 points) (
- Measurement tampering detection as a special case of weak-to-strong generalization by 23 Dec 2023 0:05 UTC; 57 points) (
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- Critique of some recent philosophy of LLMs’ minds by 20 Jan 2023 12:53 UTC; 52 points) (
- Searching for a model’s concepts by their shape – a theoretical framework by 23 Feb 2023 20:14 UTC; 51 points) (
- [RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”. by 25 Jan 2023 19:03 UTC; 48 points) (
- Take 11: “Aligning language models” should be weirder. by 18 Dec 2022 14:14 UTC; 34 points) (
- EA & LW Forums Weekly Summary (12th Dec − 18th Dec 22′) by 20 Dec 2022 9:49 UTC; 27 points) (EA Forum;
- An open letter to SERI MATS program organisers by 20 Apr 2023 16:34 UTC; 26 points) (
- Collin Burns on Alignment Research And Discovering Latent Knowledge Without Supervision by 17 Jan 2023 17:21 UTC; 25 points) (
- Collin Burns on Alignment Research And Discovering Latent Knowledge Without Supervision by 17 Jan 2023 17:21 UTC; 21 points) (EA Forum;
- 15 Nov 2023 19:04 UTC; 20 points) 's comment on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?” by (
- A Review of Weak to Strong Generalization [AI Safety Camp] by 7 Mar 2024 17:16 UTC; 13 points) (
- 18 Apr 2024 18:14 UTC; 11 points) 's comment on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer by (
- EA & LW Forums Weekly Summary (12th Dec − 18th Dec 22′) by 20 Dec 2022 9:49 UTC; 10 points) (
- Unsupervised Methods for Concept Discovery in AlphaZero by 26 Oct 2023 19:05 UTC; 9 points) (
- 15 Apr 2023 1:33 UTC; 8 points) 's comment on Nobody’s on the ball on AGI alignment by (EA Forum;
- Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods) by 5 Aug 2023 17:55 UTC; 6 points) (
- 2 Jan 2023 7:49 UTC; 3 points) 's comment on Alignment, Anger, and Love: Preparing for the Emergence of Superintelligent AI by (
- 24 Apr 2023 2:00 UTC; 1 point) 's comment on An open letter to SERI MATS program organisers by (
- 6 Oct 2023 23:01 UTC; 1 point) 's comment on AI #32: Lie Detector by (
- 15 Dec 2022 20:53 UTC; 1 point) 's comment on Discovering Latent Knowledge in Language Models Without Supervision by (
- 20 Dec 2022 15:47 UTC; 1 point) 's comment on Existential AI Safety is NOT separate from near-term applications by (
This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post.
Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper.
TL;DR: The paper made significant contributions by introducing the idea of unsupervised knowledge discovery to a broader audience and by demonstrating that relatively straightforward techniques may make substantial progress on this problem. Compared to the paper, the blog post is substantially more nuanced, and I think that more academic-leaning AIS researchers should also publish companion blog posts of this kind. Collin Burns also deserves a lot of credit for actually doing empirical work in this domain when others were skeptical. However, the results are somewhat overstated and, with the benefit of hindsight, (vanilla) CCS does not seem to be a particularly promising technique for eliciting knowledge from language models. That being said, I encourage work in this area.[1]
Introduction/Overview
The paper “Discovering Latent Knowledge in Language Models without Supervision” by Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt (henceforth referred to as “the CCS paper” for short) proposes a method for unsupervised knowledge discovery, which can be thought of as a variant of empirical, average-case Eliciting Latent Knowledge (ELK). In this companion blog post, Collin Burns discusses the motivations behind the paper, caveats some of the limitations of the paper, and provides some reasons for why this style of unsupervised methods may scale to future language models.
The CCS paper kicked off a lot of waves in the alignment community when it came out. The OpenAI Alignment team was very excited about the paper. Eliezer Yudkowsky even called it “Very Dignified Work!”. There’s also been a significant amount of followup work that discusses or builds on CCS, e.g. these Alignment Forum Posts:
Contrast Pairs Drive the Empirical Performance of Contrast-Consistent Search (CCS) by Scott Emmons
What Discovering Latent Knowledge Did and Did Not Find by Fabien Roger
As well as the following papers:[2]
Still No Lie Detector for Language Models by Levinstein and Herrmann
The Geometry of Truth: Emergent Linear Structure in Larger Language Model Representations of True/False Datasets by Sam Marks and Max Tegmark
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? by Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas
Challenges with unsupervised LLM knowledge discovery by Farquhar et al., which also has a companion AF piece.
So it seems a pity that no one has provided a review for this post. This is my attempt to fix that.
Unfortunately, this review has ballooned into a much longer post. To make it more digestible, I’ve divided it into sections:
The CCS paper and follow-up work
The post itself
Overall, I give this post a high but not maximally high rating. I think that the paper made significant contributions, albeit with important caveats and limitations. While I also have some quibbles with the post, I think the post does a good job of situating the paper and the general research direction in an alignment scheme. Collin Burns also deserves significant credit for pioneering the research direction in general; many people at the time (including myself) were decently surprised by the positive results.
The Paper and Follow-up Work
The headline method, Contrast-Consistent Search (CCS), learns a linear probe[3] that predicts the probabilities of a binary label, without supervised data.[4] CCS does this by first generating “contrast pairs” of statements with positive and negative answers,[5] and then maximizes the consistency and confidence of the probabilities for each pair. It then combines the predictions on the positive/negative answers to get a number that corresponds to either the probability of the “true”, correct answer or the “false”, incorrect answer. To evaluate this method, the authors consider whether assigning this classifier to “true” or “false” leads to higher test accuracy, and then pick the higher-accuracy assignment.[6] They show that this lets them outperform directly querying the model by ~4% on average across a selection of datasets, and can work in cases where the prompt is misleading.
Important takeaways
Here’s some of my thoughts on some useful updates I made as a result of this paper. You can also think of this as the “strengths” section, though I only talk about things in the paper that seem true to me and don’t e.g. praise the authors for having very clear writing and for being very good at conveying and promoting their ideas.
Linear probes are pretty good for recovering salient features, and some truth-like feature is salient on many simple datasets. CCS only uses linear probes, and achieves good performance. This suggests that some feature akin to “truthiness” is represented linearly inside of the model for these contrast pairs, a result that seems somewhat borne out in follow-up work, albeit with major caveats.
I also think that this result is consistent with other results across a large variety of papers. For example, Been Kim and collaborators were using linear probes to do interp on image models as early as 2017, while attaching linear classification heads to various segments of models has been a tradition since before I started following ML research (i.e. before ~late 2015). In terms of more recent work, we’ve not only seen work on toy-ish models showing that small transformers trained on Othello and Chess have linear world representations, but we’ve seen that simple techniques for finding linear directions can often be used to successfully steer language models to some extent. For a better summary of these results, consider reading the Representation Engineering and Linear Representation Hypothesis papers, as well as the 2020 Circuits thread on “Features as Directions”.
Simple empirical techniques can make progress. The CCS paper takes a relatively straightforward idea, and executes on it well in an empirical setting. I endorse this strategy and think more people in AIS (but not the ML community in general) should do things like this.
Around late 2022, I became significantly more convinced that the basic machine learning technique of “try the simplest thing”. The CCS work was a significant reason for this, because I expected it to fail and was pleasantly surprised by the positive results. I also think that I’ve updated upwards on the fact that executing “try the simplest thing” well is surprisingly difficult. I think that even in cases where the “obvious” thing is probably doomed to fail, it’s worth having someone try it anyway, because 1) you can be wrong, and more importantly 2) the way in which it fails can be informative. See also, obvious advice by Nate Soares and this tweet by Sam Altman.
It’s worth studying empirical, average-case ELK. In particular, I think that it’s worth “doing the obvious thing” when it comes to ELK. My personal guess is that (worst-case) ELK is really hard, and that simple linear probes are unlikely to work for it because there’s not really an actual “truth” vector represented by LLMs. However, there’s still a chance that it might nonetheless work in the average case. (See also, this discussion of empirical generalization.) I think ultimately, this is a quantitative question that needs to be resolved empirically – to what extent can we find linear directions inside LLMs that correspond very well with truth? What does the geometry of the LLM activation space actually look like?
Caveating claims from the CCS Paper
While I hold an overall positive view of the paper, I do think that some of the claims in the paper have either not held up over time, or are easily understood to mean something false. I’ll talk about some of my quibbles below.
The CCS algorithm as stated does not seem to reliably recover a robust “truth” direction. The first and biggest problem is that CCS does not perform that well at its intended goal. Notably, CCS classifiers may pick up on other prominent features instead of the intended “truth” direction (even sometimes on contrast pairs that only differ in the label!). Some results from the Still No Lie Detector for Language Models suggest that this may be because CCS is representing which labels are positive versus negative (i.e. the normalization in the CCS paper does not always work to remove this information). Note that Collin does discuss this as a possible issue (for future language models) in the blog post, but this is not discussed in the paper.
In addition, all of the Geometry of Truth paper, Fabien’s results on CCS, and the Challenges with unsupervised knowledged discovery paper note that CCS tends to be very dependent on the prompt for generative LMs.
Does CCS work for the reasons the authors claim it does? From reading the introduction and the abstract, one might get the impression that the key insight is that truth satisfies a particular consistency condition, and thus that the consistency loss proposed in the paper is driving a lot of the results.
However, I’m reasonably confident that it’s the contrast pairs that are driving CCS’s performance. For example, the Challenges with unsupervised knowledged discovery paper found that CCS does not outperform other unsupervised clustering methods. And as Scott Emmons notes, this is supported by section 3.3.3, where two other unsupervised clustering methods are competitive with CCS. Both the Challenges paper and Scott Emmon’s post also argue that CCS’s consistency loss is not particularly different from normal unsupervised learning losses. On the other hand, there’s significant circumstantial evidence that carefully crafted contrast pairs alone often define meaningful directions, for example in the activation addition literature.
There’s also two proofs in the Challenges paper, showing “that arbitrary features satisfy the consistency structure of [CCS]”, that I have not had time to fully grok and situate. But insofar as you can take this claim is correct, this is further evidence against CCS’s empirical performance being driven by its consistency loss.
A few nitpicks on misleading presentation. One complaint I have is that the authors use the test set to decide if higher numbers from their learned classifier correspond to “true” or “false”. This is mentioned briefly in a single sentence in section two (“For simplicity in our evaluations we take the maximum accuracy over the two possible ways of labeling the predictions of a given test set.”) and then one possible solution is discussed in Appendix A but not implemented. Also worth noting that the evaluation methodology used gives an unfair advantage to CCS (as it can never do worse than random chance on the test set, while many of the supervised methods perform worse than random).
This isn’t necessarily a big strike against the paper: there’s only so much time for each project and the CCS paper already contains a substantial amount of content. I do wish that this were more clearly highlighted or discussed, as I think that this weakens the repeated claims that CCS “uses no labels” or “is completely unsupervised”.
My Take
I think that the paper made significant contributions by significantly signal boosting the idea of unsupervised knowledge discovery,[7] and showed that you can achieve decent performance by contrast pairs and consistency checks. It also has spurred a large amount of follow-up work and interest in the topic.
However, the paper is somewhat misleading in its presentation, and the primary driver of performance seems to be the construction of the contrast pairs and not the loss function. Follow-up work has also found that the results of the paper can be brittle, and suggest that CCS does not necessarily find a singular “truth direction”.
The Post
The post starts out by discussing the setup and motivation of unsupervised knowledge recovery. Suppose you wanted to make a model “honestly” report its beliefs. When the models are sub human or even human level, you can use supervised methods. Once the models are superhuman, these methods probably won’t scale for many reasons. However, if we used unsupervised methods, there might not be a stark difference between human-level and superhuman models, since we’re not relying on human knowledge.
The post then goes into some intuitions for why it might be possible: interp on vision models has found that models seem to learn meaningful features, and models seem to linearly represent many human-interpretable features.
Then, Collin talks about the key takeaways from his paper, and also lists many caveats and limitations of his results.
Finally, the post concludes by explaining the challenges of applying CCS-style knowledge discovery techniques to powerful future LMs, as well as why Collin thinks that these unsupervised techniques may scale.
A minor nitpick:
Collin says:
As discussed above, the methodology in the paper guarantees that any classifier will do at least as good at random chance. I’m not actually sure how much worse if you orient the classifier on the training set as opposed to the test set. (And I’d be excited for someone to check this!)
My Take
I think this blog post is quite good, and I wish more academic-adjacent ML people would write blog posts caveating their results and discussing where they fit in. I especially appreciated the section on what the paper does and does not show, which I think accurately represents the evidence presented in the paper (as opposed to overhyping or downplaying it). In addition, I think Collin makes a strong case for studying more unsupervised approaches to alignment.
I also strongly agree with Collin that it can be “extremely valuable to sketch out what a full solution could plausibly look like given [your] current model of how deep learning systems work”, and wish more people would do this.
—
Acknowledgments
Thanks to Aryan Bhatt, Stephen Casper, Adria-Garriga Alonso, David Rein, and others for helpful conversations on this topic. Thanks for Raymond Arnold for poking me into writing this.
(EDIT Jan 15th: added my footnotes back in to the comment, which were lost during the copy/paste I did from Google Docs.)
This originally was supposed to link to a list of projects I’d be excited for people to do in this area, but I ran out of time before the LW review deadline.
I also draw on evidence from many, many other papers in related areas, which unfortunately I do not have time to list fully. A lot of my intuitions come from work on steering LLMs with activation additions, conversations with Redwood researchers on various kinds of coup or deception probes, and linear probing in general.
That is, a direction with an additional bias term.
Unlike other linear probing techniques, CCS does this without needing to know if the positive or negative answer is correct. However, some amount of labeled data is needed later to turn this pair into a classifier for truth/falsehood.
I’ll use “Positive” and “Negative” for the sake of simplicity in this review, though in the actual paper they also consider “Yes” and “No” as well as different labels for news classification and story completion.
Note that this gives their method a small advantage in their evaluation, since CCS gets to fit a binary parameter on the test set while the other methods do not. Using a fairer baseline does significantly negatively affect their headline numbers, but I think that the results generally hold up anyways. (I haven’t actually written any code for this review, so I’m trusting the reports of collaborators who have looked into it, as well as Fabien’s results that use randomly initialized directions with the CCS evaluation methodology instead of trained CCS directions.)
Also, while writing this review, I originally thought that this issue was addressed in section 3.3.3, but that only addresses fitting the classifiers with fewer contrast pairs, as opposed to deciding whether the combined classifier corresponds to correct or incorrect answers. After spending ~5 minutes staring at the numbers and thinking that they didn’t make sense, I realized my mistake. Ugh.
This originally said “by introducing the idea”, but some people who reviewed the post convinced me otherwise. It’s definitely an introduction for many academics, however.
The authors call this a “direction” in their abstract, but CCS really learns two directions. This is a nitpick.
Beyond the paper and post, I think it seems important to note the community reaction to this work. I think many people dramatically overrated the empirical results in this work due to a combination of misunderstanding what was actually done, misunderstanding why the method worked (which follow up work helped to clarify as you noted), and incorrectly predicting the method would work in many cases where it doesn’t.
The actual conceptual ideas discussed in the blog post seem quite good and somewhat original (this is certainly the best presentation of these sort of ideas in this space at the time it came out). But, I think the conceptual ideas got vastly more traction than otherwise due to people having a very misleadingly favorable impression of the empirical results. I might elaborate more on takeaways related to this in a follow up post.
I speculate that at least three factors made CCS viral:
It was published shortly after the Eliciting Latent Knowledge (ELK) report. At that time, ELK was not only exciting, but new and exciting.
It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
CCS mathematizes “truth” and explains it clearly. It would be really nice if the project of human rationality also helped with the alignment problem. So, CCS is an idea that people want to see work.
[Minor terminology point, unimportant]
FWIW, I personally wouldn’t describe this as interpretability research, I would instead call this “model internals research” or something. Like the research doesn’t necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.
I agree that people dramatically overrated the empirical results of this work, but not more so than other pieces that “went viral” in this community. I’d be excited to see your takes on this general phenomenon as well as how we might address it in the future.
I agree not more than other pieces that “went viral”, but I think that the lasting impact of the misconceptions seem much larger in the case of CCS. This is probably due to the conceptual ideas actually holding up in the case of CCS.