Collin
Thanks for writing this! I think there are a number of interesting directions here.
I think in (very roughly) increasing order of excitement:
Connections to mechanistic interpretability
I think it would be nice to have connections to mechanistic interpretability. My main concern here is just that this seems quite hard to me in general. But I could imagine some particular sub-questions here being more tractable, such as connections to ROME/MEMIT in particular.
Improving the loss function + using other consistency constraints
In general I’m interested in work that makes CCS more reliable/robust; it’s currently more of a prototype than something ready for practice. But I think some types of practical improvements seem more conceptually deep than others.
I particularly agree that L_confidence doesn’t seem like quite what we want, so I’d love to see improvements there.
I’m definitely interested in extensions to more consistency properties, though I’m not sure if conjunctions/disjunctions alone lets you avoid degenerate solutions without L_confidence. (EDIT: never mind, I now think this has a reasonable chance of working.)
Perhaps more importantly, I worry that it might be a bit too difficult in practice right now to make effective use of conjunctions and disjunctions in current models – I think they might be too bad at conjunctions/disjunctions, in the sense that a linear probe wouldn’t get high accuracy (at least with current open source models). But I think someone should definitely try this.
Understanding simulated agents
I’m very excited to see work on understanding how this type of method works when applied to models that are simulating other agents/perspectives.
Generalizing to other concepts
I found the connection to ramsification + natural abstractions interesting, and I’m very interested in the idea of thinking about how you can generalize this to searching for other concepts (other than truth) in an unsupervised + non-mechanistic way.
I’m excited to see where this work goes!
What AI Safety Materials Do ML Researchers Find Compelling?
I think this is likely right by default in many settings, but I think ground-level truth does provide additional accuracy in predicting next tokens in at least some settings—such as in “Claim 1” in the post (but I don’t think that’s the only setting) -- and I suspect that will be enough for our purposes. But this is certainly related to stuff I’m actively thinking about.
There were a number of iterations with major tweaks. It went something like:
I spent a while thinking about the problem conceptually, and developed a pretty strong intuition that something like this should be possible.
I tried to show it experimentally. There were no signs of life for a while (it turns out you need to get a bunch of details right to see any real signal—a regime that I think is likely my comparative advantage) but I eventually got it to sometimes work using a PCA-based method. I think it took some work to make that more reliable, which led to what we refer to in the paper as CRC-TPC.
That method had some issues, but we also found that there was also low-hanging fruit in the sense that a good direction often appeared in one of the top 2 principal components (instead of just the top one). It also seemed kind of weird to really care about high-variance directions even when variance isn’t necessarily functionally meaningful (since you can rescale subsequent layers).
This led to CRC-BSS, which is scale-invariant. This worked better (a bit more reliable, seemed to work well in cases where the good direction was in the top 2 principal components, etc.). But it was still based on the original intuition of clustering.
I started developing the intuition that “old school” or “geometric” unsupervised methods—like clustering—can be decent but that they’re not really the right way to think about things relative to a more “functional” deep learning perspective. I also thought we should be able to do something similar without explicitly relying on linear structure in the representations, and eventually started thinking about my interpretation of what CRC is doing as finding a direction satisfying consistency properties. After another round of experimentation with the method, this finally led to CCS.
Each stage required a number of iterations to get various details right (and even then, I’m pretty sure I could continue to improve things with more iterations like that, but decided that’s not really the point of the paper or my comparative advantage).
In general I do a lot of back and forth between thinking conceptually about the problem for long periods of time to develop intuitions (I’m extremely intuitions-driven) and periods where I focus on experiments that were inspired by those intuitions.
I feel like I have more to say on this topic, so maybe I’ll write a future post about it with more details, but I’ll leave it at that for now. Hopefully this is helpful.
Thanks! I personally think of it as both “contrastive” and “unsupervised,” but I do think similar contrastive techniques can be applied in the supervised case too—as some prior work like https://arxiv.org/abs/1607.06520 has done. I agree it’s less clear how to do this for open-ended questions compared to boolean T/F questions, but I think the latter captures the core difficulty of the problem. For example, in the simplest case you could do rejection sampling for controllable generation of open-ended outputs. Alternatively, maybe you want to train a model to generate text that both appears useful (as assessed by human supervision) while also being correct (as assessed by a method like CCS). So I agree supervision seems useful too for some parts of the problem.
Thanks for the detailed comment! I agree with a lot of this.
So I’m particularly interested in understanding whether these methods work for models like Go policies that are not pre-trained on a bunch of true natural language sentences.
Yep, I agree with this; I’m currently thinking about/working on this type of thing.
I think “this intuition is basically incorrect” is kind of an overstatement, or perhaps a slight mischaracterization of the reason that people aren’t more excited about unsupervised methods. In my mind, unsupervised methods mostly work well if the truth is represented in a sufficiently simple way. But this seems very similar to the quantitative assumption required for regularized supervised methods to work.
This is a helpful clarification, thanks. I think I probably did just slightly misunderstand what you/others thought.
But I do personally think of unsupervised methods more broadly than just working well if truth is represented in a sufficiently simple way. I agree that many unsupervised methods—such as clustering—require that truth is represented in a simple way. But I often think of my goal more broadly as trying to take the intersection of enough properties that we can uniquely identify the truth.
The sense in which I’m excited about “unsupervised” approaches is that I intuitively feel optimistic about specifying enough unsupervised properties that we can do this, and I don’t really think human oversight will be very helpful for doing so. But I think I may also be pushing back more against approaches heavily reliant on human feedback like amplification/debate rather than e.g. your current thinking on ELK (which doesn’t seem as heavily reliant on human supervision).
I think the use of unsupervised methods is mostly helpful for validation; the method would be strictly more likely to work if you also throw in whatever labels you have, but if you end up needing the labels in order to get good performance then you should probably assume you are overfitting. That said, I’m not really sure if it’s better to use consistency to train + labels as validation, or labels to train + consistency as validation, or something else altogether.
I basically agree with your first point about it mostly being helpful for validation. For your second point, I’m not really sure what it’d look like to use consistency as validation. (If you just trained a supervised probe and found that it was consistent in ways that we can check, I don’t think this would provide much additional information. So I’m assuming you mean something else?)
If a model actually understands things humans don’t, I have a much less of a clear picture for why natural language claims about the world would be represented in a super simple way. I agree with your claim 1, and I even agree with claim 2 if you interpret “represent” appropriately, but I think the key question is how simple it is to decode that representation relative to “use your knowledge to give the answer that minimizes the loss.” The core empirical hypothesis is that “report the truth” is simpler than “minimize loss,” and I didn’t find the analysis in this section super convincing on this point.
But I do agree strongly that this hypothesis has a good chance of being true (I think better than even odds), at least for some time past human level, and a key priority for AI alignment is testing that hypothesis. My personal sense is that if you look at what would actually have to happen for all of the approaches in this section to fail, it just seems kind of crazy. So focusing on those failures is more of a subtle methodological decision and it makes sense to instead cross that bride if we come to it.
A possible reframing of my intuition is that representations of truth in future models will be pretty analogous to representations of sentiment in current models. But my guess is that you would disagree with this; if so, is there a specific disanalogy that you can point to so that I can understand your view better?
And in case it’s helpful to quantify, I think I’m maybe at ~75-80% that the hypothesis is true in the relevant sense, with most of that probability mass coming from the qualification “or it will be easy to modify GPT-n to make this true (e.g. by prompting it appropriately, or tweaking how it is trained)”. So I’m not sure just how big our disagreement is here. (Maybe you’re at like 60%?)
That said, I think I’m a bit more tentative about the interpretation of the results than you seem to be in this post. I think it’s pretty unsurprising to compete with zero-shot, i.e. it’s unsurprising that there would be cleanly represented features very similar to what the model will output. That makes the interpretation of the test a lot more confusing to me, and also means we need to focus more on outperforming zero shot.
For outperforming zero shot I’d summarize your quantitative results as CCS as covering about half of the gap from zero-shot to supervised logistic regression. If LR was really just the “honest answers” then this would seem like a negative result, but LR likely teaches the model new things about the task definition and so it’s much less clear how to interpret this. On the other hand, LR also requires representations to be linear and so doesn’t give much evidence about whether truth is indeed represented linearly.
Maybe the main disagreement here is that I did find it surprising that we could compete with zero-shot just using unlabeled model activations. (In contrast, I agree that it’s “it’s unsurprising that there would be cleanly represented features very similar to what the model will output”—but I would’ve expected to need a supervised probe to find this.) Relatedly, I agree our paper doesn’t give much evidence on whether truth will be represented linearly for future models on superhuman questions/answers—that wasn’t one of the main questions we were trying to answer, but it is certainly something I’d like to be able to test in the future.
(And as an aside, I don’t think our method literally requires that truth is linearly represented; you can also train it with an MLP probe, for example. In some preliminary experiments that seemed to perform similarly but less reliably than a linear probe—I suspect just because “truth of what a human would say” really is ~linearly represented in current models, as you seem to agree with—but if you believe a small MLP probe would be sufficient to decode the truth rather than something literally linear then this might be relevant.)
Thanks Ansh!
It seems pretty plausible to me that a human simulator within GPT-n (the part producing the “what a human would say” features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also doesn’t feel all that worst-case to me, but maybe we disagree on that point.
I agree there are plenty of examples whether humans would be confident when they shouldn’t be. But to clarify, we can choose whatever examples we want in this step, so we can explicitly choose examples where we know humans have no real opinion about what the answer should be.
I agree this proposal wouldn’t be robust enough to optimize against as-stated, but this doesn’t bother me much for a couple reasons:
This seems like a very natural sub-problem that captures a large fraction of the difficulty of the full problem while being more tractable. Even just from a general research perspective that seems quite appealing—at a minimum, I think solving this would teach us a lot.
It seems like even without optimization this could give us access to something like aligned superintelligent oracle models. I think this would represent significant progress and would be a very useful tool for more robust solutions.
I have some more detailed thoughts about how we could extend this to a full/robust solution (though I’ve also deliberately thought much less about that than how to solve this sub-problem), but I don’t think that’s really the point—this already seems like a pretty robustly good problem to work on to me.
(But I do think this is an important point that I forgot to mention, so thanks for bringing it up!)
How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme
Thanks for writing this up! I basically agree with most of your findings/takeaways.
In general I think getting the academic community to be sympathetic to safety is quite a bit more tractable (and important) than most people here believe, and I think it’s becoming much more tractable over time. Right now, perhaps the single biggest bottleneck for most academics is having long timelines. But most academics are also legitimately impressed by recent progress, which I think has made them much more open to considering AGI than they used to be at least, and I think this trend will likely accelerate over the next few years as we see much more impressive models.
Thanks for running these experiments and writing this up! I’m very excited to see this sort of followup work, and I think there are a lot of useful results here. I agree with most of this, and mostly just have a few nitpicks about how you interpret some things.
Reactions to the summary of your experimental results:
CCS does so better than random, but not by a huge margin: on average, random linear probes have a 75% accuracy on some “easy” datasets;
I think it’s cool that random directions sometimes do so well; this provides a bit of additional evidence that high-accuracy directions can be quite salient.
It’s not too surprising to me that this holds for UQA (which is instruction-tuned, so it should intuitively have a particularly salient truth-y direction) and the easiest datasets like IMDB, but I doubt this holds for most models and datasets. At the very least I recall looking at this in one narrow setting before and got close to random performance (though I don’t remember what setting that was exactly). I’d be curious to see this for more model/dataset pairs.
This is more of a nitpick with your phrasing, but FWIW based on the plot it does still look to me like CCS is better by a large margin (e.g. 75% accuracy to 95% accuracy is a 5x reduction in error) even in the settings where random directions do best. The takeaway for me here is mostly just that random direction can sometimes be surprisingly good—which I agree is interesting.
CCS does not find the single linear probe with high accuracy: there are more than 20 orthogonal linear probes (i.e. using completely different information) that have similar accuracies as the linear probe found by CCS (for most datasets);
Yep, I agree; we definitely weren’t claiming to find a uniquely good direction. I also quite like the recursive CCS experiment, and would love to see more experiments along these lines.
I think it’s interesting that you can find 20 orthogonal probes each with high accuracy. But I’m especially interested in how many functionally equivalent directions there are. For example, if this 20 dimensional subspace actually corresponds to roughly the same clustering into true and false, then I would say they are ~equivalent for my purposes.
CCS does not always find a probe with low test CCS loss (Figure 1 of the paper is misleading). CSS finds probes which are sometimes overconfident in inconsistent predictions on the test set, resulting in a test loss that is sometimes higher than always predicting a constant probability;
We indeed forgot to specify that figure 1 is with the train set — that was a good catch that we’ll specify in the arxiv version soon (we’ve already done so for the camera ready version).
I think some of your experiments here are also interesting and helpful. That said, I do want to emphasize that I don’t find train-test distinctions particularly essential here because our method is unsupervised; I ultimately just want to find a direction that gives correct predictions to superhuman examples, and we can always provide those superhuman examples as part of the training data.
Another way of putting this is that I view our method as being an unsupervised clustering method, for which I mostly just care about finding high-accuracy clusters. We also show it generalizes/transfers, but IMO that’s of secondary importance.
I also think it’s interesting to look at loss rather than just accuracy (which we didn’t do much), but I do ultimately think accuracy is quite a bit more important overall.
CCS’ performance on GPT-J heavily depends on the last tokens of the input, especially when looking at the last layers’ activations (the setting used in the paper).
Thanks, these are helpful results. The experiment showing that intermediate layers can be better in some ways also feel similar to our finding that intermediate layers do better than later layers when using a misleading prompt. In general I would indeed like to see more work trying to figure out what’s going on with autoregressive models like GPT-J.
Reactions to your main takeaways:
However, we still don’t know if this feature corresponds to the model’s “beliefs”.
I agree. (I also still doubt current models have “beliefs” in any deep sense)
Future work should compare their work against the random probe baseline. Comparing to a 50% random guessing baseline is misleading, as random probes have higher accuracy than that.
I agree, this seems like a good point.
CCS will likely miss important information about the model’s beliefs because there is more than one linear probe which achieves low loss and high CCS accuracy, i.e. there is more than one truth-like feature… There are many orthogonal linear probes which achieve low loss and high CCS accuracy, i.e. there are many truth-like features. Narrowing down which linear probe corresponds to the model’s beliefs might be hard.
I think your experiments here are quite interesting, but I still don’t think they show that there are many functionally different truth-like features, which is what I mostly care about. This is also why I don’t think this provides much evidence that “Narrowing down which linear probe corresponds to the model’s beliefs might be hard.” (If you have two well-separated clusters in high dimensional space, I would expect there to be a large space of separating hyperplanes — this is my current best guess for what’s going on with your results.)
There exists a direction which contains all linearly available information about truth, i.e. you can’t train a linear classifier to classify true from untrue texts after projecting the activations along this direction. CCS doesn’t find it. This means CCS is ill-suited for ablation-related experiments.
I definitely agree that vanilla CCS is ill-suited for ablation-related experiments; I think even supervised linear probes are probably not what we want for ablation-related experiments, and CCS is clearly worse than logistic regression.
I like this experiment. This suggests to me that there really is just functionally one truth-like feature/direction. I agree your results imply that vanilla CCS is not finding this direction geometrically, but your results make me more optimistic we can actually find this direction using CCS and maybe just a little bit more work. For example, I’d be curious to see what happens if you first get predictions from CCS, then use those predictions to get two clusters, then take the difference in means between those induced clusters. Do you get a better direction?
More generally I’m still interested in whether this has meaningfully different predictions from what CCS finds or not.
Future work should use more data or more regularization than the original paper did if it wants to find features which are actually truth-like.
This seems like a useful finding!
To get clean results, use CCS on UQA, and don’t get too close to GPT models. Investigating when and why CCS sometimes fails with GPT models could be a promising research direction.
I basically agree, though I would suggest studying non-instruction tuned encoder or encoder-decoder models like T5 or (vanilla) DeBERTa as well, since instruction tuning might affect things.
When using CCS on GPT models, don’t use CCS only on the last layer, as probes trained on activations earlier in the network are less sensitive to the format of the input.
This also seems like an interesting finding, thanks!