Thanks for writing, I mostly agree. I particularly like the point that it’s exciting to study methods for which “human level” vs “subhuman level” isn’t an important distinction. One of my main reservations is that this distinction can be important for language models because the pre-training distribution is at human level (as you acknowledge).
I mostly agree with your assessment of difficulties and am most concerned about worry 2, especially once we no longer have a pre-training distribution anchoring their beliefs to human utterances. So I’m particularly interested in understanding whether these methods work for models like Go policies that are not pre-trained on a bunch of true natural language sentences. I agree with you that “other truth-like features” don’t seem like a big problem if you are indeed able to find a super-simple representation of truth (e.g. linear probes seem way simpler than “answers calculated to look like the truth,” and anything short of that seems easily ruled out by more stringent consistency checks).
I think the main reason unsupervised methods haven’t been seriously considered within alignment so far, as far as I can tell, is because of tractability concerns. It naively seems kind of impossible to get models to (say) be honest or truthful without any human supervision at all; what would such a method even look like?
To me, one of the main contributions of our paper is to show that this intuition is basically incorrect and to show that unsupervised methods can be surprisingly effective.
I think “this intuition is basically incorrect” is kind of an overstatement, or perhaps a slight mischaracterization of the reason that people aren’t more excited about unsupervised methods. In my mind, unsupervised methods mostly work well if the truth is represented in a sufficiently simple way. But this seems very similar to the quantitative assumption required for regularized supervised methods to work.
For example, if truth is represented linearly, then “answer questions honestly” is much simpler than almost any of the “cheating” strategies discussed in Eliciting Latent Knowledge, and I think we are in agreement that under these conditions it will be relatively easy to extract the model’s true beliefs. So I feel like the core question is really about how simply truth is represented within a model, rather than about supervised vs unsupervised methods.
I agree with your point that unsupervised methods can provide more scalable evidence of alignment. I think the use of unsupervised methods is mostly helpful for validation; the method would be strictly more likely to work if you also throw in whatever labels you have, but if you end up needing the labels in order to get good performance then you should probably assume you are overfitting. That said, I’m not really sure if it’s better to use consistency to train + labels as validation, or labels to train + consistency as validation, or something else altogether. Seems good to explore all options, but this is hopefully helps explain why I find the claim in the intro kind of overstated.
Why I Think (Superhuman) GPT-n Will Represent Whether an Input is Actually True or False
If a model actually understands things humans don’t, I have a much less of a clear picture for why natural language claims about the world would be represented in a super simple way. I agree with your claim 1, and I even agree with claim 2 if you interpret “represent” appropriately, but I think the key question is how simple it is to decode that representation relative to “use your knowledge to give the answer that minimizes the loss.” The core empirical hypothesis is that “report the truth” is simpler than “minimize loss,” and I didn’t find the analysis in this section super convincing on this point.
But I do agree strongly that this hypothesis has a good chance of being true (I think better than even odds), at least for some time past human level, and a key priority for AI alignment is testing that hypothesis. My personal sense is that if you look at what would actually have to happen for all of the approaches in this section to fail, it just seems kind of crazy. So focusing on those failures is more of a subtle methodological decision and it makes sense to instead cross that bride if we come to it.
I liked your paper in part as a test of this hypothesis, and I’m very excited about future work that goes further.
That said, I think I’m a bit more tentative about the interpretation of the results than you seem to be in this post. I think it’s pretty unsurprising to compete with zero-shot, i.e. it’s unsurprising that there would be cleanly represented features very similar to what the model will output. That makes the interpretation of the test a lot more confusing to me, and also means we need to focus more on outperforming zero shot.
For outperforming zero shot I’d summarize your quantitative results as CCS as covering about half of the gap from zero-shot to supervised logistic regression. If LR was really just the “honest answers” then this would seem like a negative result, but LR likely teaches the model new things about the task definition and so it’s much less clear how to interpret this. On the other hand, LR also requires representations to be linear and so doesn’t give much evidence about whether truth is indeed represented linearly.
I agree with you that there’s a lot of room to improve on this method, but I think that the ultimately the core questions are quantitative and as a result quantitative concerns about the method aren’t merely indicators that something needs to be improved but also affect whether you’ve gotten strong evidence for the core empirical conjecture. (Though I do think that your conjecture is more likely than not for subhuman models trained on human text.)
Thanks for the detailed comment! I agree with a lot of this.
So I’m particularly interested in understanding whether these methods work for models like Go policies that are not pre-trained on a bunch of true natural language sentences.
Yep, I agree with this; I’m currently thinking about/working on this type of thing.
I think “this intuition is basically incorrect” is kind of an overstatement, or perhaps a slight mischaracterization of the reason that people aren’t more excited about unsupervised methods. In my mind, unsupervised methods mostly work well if the truth is represented in a sufficiently simple way. But this seems very similar to the quantitative assumption required for regularized supervised methods to work.
This is a helpful clarification, thanks. I think I probably did just slightly misunderstand what you/others thought.
But I do personally think of unsupervised methods more broadly than just working well if truth is represented in a sufficiently simple way. I agree that many unsupervised methods—such as clustering—require that truth is represented in a simple way. But I often think of my goal more broadly as trying to take the intersection of enough properties that we can uniquely identify the truth.
The sense in which I’m excited about “unsupervised” approaches is that I intuitively feel optimistic about specifying enough unsupervised properties that we can do this, and I don’t really think human oversight will be very helpful for doing so. But I think I may also be pushing back more against approaches heavily reliant on human feedback like amplification/debate rather than e.g. your current thinking on ELK (which doesn’t seem as heavily reliant on human supervision).
I think the use of unsupervised methods is mostly helpful for validation; the method would be strictly more likely to work if you also throw in whatever labels you have, but if you end up needing the labels in order to get good performance then you should probably assume you are overfitting. That said, I’m not really sure if it’s better to use consistency to train + labels as validation, or labels to train + consistency as validation, or something else altogether.
I basically agree with your first point about it mostly being helpful for validation. For your second point, I’m not really sure what it’d look like to use consistency as validation. (If you just trained a supervised probe and found that it was consistent in ways that we can check, I don’t think this would provide much additional information. So I’m assuming you mean something else?)
If a model actually understands things humans don’t, I have a much less of a clear picture for why natural language claims about the world would be represented in a super simple way. I agree with your claim 1, and I even agree with claim 2 if you interpret “represent” appropriately, but I think the key question is how simple it is to decode that representation relative to “use your knowledge to give the answer that minimizes the loss.” The core empirical hypothesis is that “report the truth” is simpler than “minimize loss,” and I didn’t find the analysis in this section super convincing on this point.
But I do agree strongly that this hypothesis has a good chance of being true (I think better than even odds), at least for some time past human level, and a key priority for AI alignment is testing that hypothesis. My personal sense is that if you look at what would actually have to happen for all of the approaches in this section to fail, it just seems kind of crazy. So focusing on those failures is more of a subtle methodological decision and it makes sense to instead cross that bride if we come to it.
A possible reframing of my intuition is that representations of truth in future models will be pretty analogous to representations of sentiment in current models. But my guess is that you would disagree with this; if so, is there a specific disanalogy that you can point to so that I can understand your view better?
And in case it’s helpful to quantify, I think I’m maybe at ~75-80% that the hypothesis is true in the relevant sense, with most of that probability mass coming from the qualification “or it will be easy to modify GPT-n to make this true (e.g. by prompting it appropriately, or tweaking how it is trained)”. So I’m not sure just how big our disagreement is here. (Maybe you’re at like 60%?)
That said, I think I’m a bit more tentative about the interpretation of the results than you seem to be in this post. I think it’s pretty unsurprising to compete with zero-shot, i.e. it’s unsurprising that there would be cleanly represented features very similar to what the model will output. That makes the interpretation of the test a lot more confusing to me, and also means we need to focus more on outperforming zero shot.
For outperforming zero shot I’d summarize your quantitative results as CCS as covering about half of the gap from zero-shot to supervised logistic regression. If LR was really just the “honest answers” then this would seem like a negative result, but LR likely teaches the model new things about the task definition and so it’s much less clear how to interpret this. On the other hand, LR also requires representations to be linear and so doesn’t give much evidence about whether truth is indeed represented linearly.
Maybe the main disagreement here is that I did find it surprising that we could compete with zero-shot just using unlabeled model activations. (In contrast, I agree that it’s “it’s unsurprising that there would be cleanly represented features very similar to what the model will output”—but I would’ve expected to need a supervised probe to find this.) Relatedly, I agree our paper doesn’t give much evidence on whether truth will be represented linearly for future models on superhuman questions/answers—that wasn’t one of the main questions we were trying to answer, but it is certainly something I’d like to be able to test in the future.
(And as an aside, I don’t think our method literally requires that truth is linearly represented; you can also train it with an MLP probe, for example. In some preliminary experiments that seemed to perform similarly but less reliably than a linear probe—I suspect just because “truth of what a human would say” really is ~linearly represented in current models, as you seem to agree with—but if you believe a small MLP probe would be sufficient to decode the truth rather than something literally linear then this might be relevant.)
Thanks for writing, I mostly agree. I particularly like the point that it’s exciting to study methods for which “human level” vs “subhuman level” isn’t an important distinction. One of my main reservations is that this distinction can be important for language models because the pre-training distribution is at human level (as you acknowledge).
I mostly agree with your assessment of difficulties and am most concerned about worry 2, especially once we no longer have a pre-training distribution anchoring their beliefs to human utterances. So I’m particularly interested in understanding whether these methods work for models like Go policies that are not pre-trained on a bunch of true natural language sentences. I agree with you that “other truth-like features” don’t seem like a big problem if you are indeed able to find a super-simple representation of truth (e.g. linear probes seem way simpler than “answers calculated to look like the truth,” and anything short of that seems easily ruled out by more stringent consistency checks).
I think “this intuition is basically incorrect” is kind of an overstatement, or perhaps a slight mischaracterization of the reason that people aren’t more excited about unsupervised methods. In my mind, unsupervised methods mostly work well if the truth is represented in a sufficiently simple way. But this seems very similar to the quantitative assumption required for regularized supervised methods to work.
For example, if truth is represented linearly, then “answer questions honestly” is much simpler than almost any of the “cheating” strategies discussed in Eliciting Latent Knowledge, and I think we are in agreement that under these conditions it will be relatively easy to extract the model’s true beliefs. So I feel like the core question is really about how simply truth is represented within a model, rather than about supervised vs unsupervised methods.
I agree with your point that unsupervised methods can provide more scalable evidence of alignment. I think the use of unsupervised methods is mostly helpful for validation; the method would be strictly more likely to work if you also throw in whatever labels you have, but if you end up needing the labels in order to get good performance then you should probably assume you are overfitting. That said, I’m not really sure if it’s better to use consistency to train + labels as validation, or labels to train + consistency as validation, or something else altogether. Seems good to explore all options, but this is hopefully helps explain why I find the claim in the intro kind of overstated.
If a model actually understands things humans don’t, I have a much less of a clear picture for why natural language claims about the world would be represented in a super simple way. I agree with your claim 1, and I even agree with claim 2 if you interpret “represent” appropriately, but I think the key question is how simple it is to decode that representation relative to “use your knowledge to give the answer that minimizes the loss.” The core empirical hypothesis is that “report the truth” is simpler than “minimize loss,” and I didn’t find the analysis in this section super convincing on this point.
But I do agree strongly that this hypothesis has a good chance of being true (I think better than even odds), at least for some time past human level, and a key priority for AI alignment is testing that hypothesis. My personal sense is that if you look at what would actually have to happen for all of the approaches in this section to fail, it just seems kind of crazy. So focusing on those failures is more of a subtle methodological decision and it makes sense to instead cross that bride if we come to it.
I liked your paper in part as a test of this hypothesis, and I’m very excited about future work that goes further.
That said, I think I’m a bit more tentative about the interpretation of the results than you seem to be in this post. I think it’s pretty unsurprising to compete with zero-shot, i.e. it’s unsurprising that there would be cleanly represented features very similar to what the model will output. That makes the interpretation of the test a lot more confusing to me, and also means we need to focus more on outperforming zero shot.
For outperforming zero shot I’d summarize your quantitative results as CCS as covering about half of the gap from zero-shot to supervised logistic regression. If LR was really just the “honest answers” then this would seem like a negative result, but LR likely teaches the model new things about the task definition and so it’s much less clear how to interpret this. On the other hand, LR also requires representations to be linear and so doesn’t give much evidence about whether truth is indeed represented linearly.
I agree with you that there’s a lot of room to improve on this method, but I think that the ultimately the core questions are quantitative and as a result quantitative concerns about the method aren’t merely indicators that something needs to be improved but also affect whether you’ve gotten strong evidence for the core empirical conjecture. (Though I do think that your conjecture is more likely than not for subhuman models trained on human text.)
Thanks for the detailed comment! I agree with a lot of this.
Yep, I agree with this; I’m currently thinking about/working on this type of thing.
This is a helpful clarification, thanks. I think I probably did just slightly misunderstand what you/others thought.
But I do personally think of unsupervised methods more broadly than just working well if truth is represented in a sufficiently simple way. I agree that many unsupervised methods—such as clustering—require that truth is represented in a simple way. But I often think of my goal more broadly as trying to take the intersection of enough properties that we can uniquely identify the truth.
The sense in which I’m excited about “unsupervised” approaches is that I intuitively feel optimistic about specifying enough unsupervised properties that we can do this, and I don’t really think human oversight will be very helpful for doing so. But I think I may also be pushing back more against approaches heavily reliant on human feedback like amplification/debate rather than e.g. your current thinking on ELK (which doesn’t seem as heavily reliant on human supervision).
I basically agree with your first point about it mostly being helpful for validation. For your second point, I’m not really sure what it’d look like to use consistency as validation. (If you just trained a supervised probe and found that it was consistent in ways that we can check, I don’t think this would provide much additional information. So I’m assuming you mean something else?)
A possible reframing of my intuition is that representations of truth in future models will be pretty analogous to representations of sentiment in current models. But my guess is that you would disagree with this; if so, is there a specific disanalogy that you can point to so that I can understand your view better?
And in case it’s helpful to quantify, I think I’m maybe at ~75-80% that the hypothesis is true in the relevant sense, with most of that probability mass coming from the qualification “or it will be easy to modify GPT-n to make this true (e.g. by prompting it appropriately, or tweaking how it is trained)”. So I’m not sure just how big our disagreement is here. (Maybe you’re at like 60%?)
Maybe the main disagreement here is that I did find it surprising that we could compete with zero-shot just using unlabeled model activations. (In contrast, I agree that it’s “it’s unsurprising that there would be cleanly represented features very similar to what the model will output”—but I would’ve expected to need a supervised probe to find this.) Relatedly, I agree our paper doesn’t give much evidence on whether truth will be represented linearly for future models on superhuman questions/answers—that wasn’t one of the main questions we were trying to answer, but it is certainly something I’d like to be able to test in the future.
(And as an aside, I don’t think our method literally requires that truth is linearly represented; you can also train it with an MLP probe, for example. In some preliminary experiments that seemed to perform similarly but less reliably than a linear probe—I suspect just because “truth of what a human would say” really is ~linearly represented in current models, as you seem to agree with—but if you believe a small MLP probe would be sufficient to decode the truth rather than something literally linear then this might be relevant.)