For the sake of brevity, I won’t go into too many more details about our paper here; for more information, check out our summary on twitter or the paper itself.
Hmm, I went to twitter to see if it had more detail, but found it to be more like “a shorter version of this overall post” rather than “more detail on the implementation details of the paper.” But, here’s a copy of it here for ease-of-reading:
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell? We show (http://arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*.
Existing techniques for training language models are misaligned with the truth: if we train models to imitate human data, they can output human-like errors; if we train them to generate highly-rated text, they can output errors that human evaluators can’t assess or don’t notice.
We propose trying to circumvent this issue by directly finding latent “truth-like” features inside language model activations without using any human supervision in the first place.
Informally, instead of trying to explicitly, externally specify ground truth labels, we search for implicit, internal “beliefs” or “knowledge” learned by a model.
This may be possible to do because truth satisfies special structure: unlike most features in a model, it is *logically consistent*
We make this intuition concrete by introducing Contrast-Consistent Search (CCS), a method that searches for a direction in activation space that satisfies negation consistency.
We find that on a diverse set of tasks (NLI, sentiment classification, cloze tasks, etc.), our method can recover correct answers from model activations with high accuracy (even outperforming zero-shot prompting) despite not using any labels or model outputs.
Among other findings, we also show that CCS really recovers something different from just the model outputs; it continues to work well in several cases where model outputs are unreliable or uninformative.
Of course, our work has important limitations and creates many new questions for future work. CCS still fails sometimes and there’s still a lot that we don’t understand about when this type of approach should be feasible in the first place.
Nevertheless, we found it surprising that we could make substantial progress on this problem at all. (Imagine recording a person’s brain activity as you tell them T/F statements, then classifying those statements as true or false just from the raw, unlabeled neural recordings!)
This problem is important because as language models become more capable, they may output false text in increasingly severe and difficult-to-detect ways. Some models may even have incentives to deliberately “lie”, which could make human feedback particularly unreliable.
However, our results suggest that unsupervised approaches to making models truthful may also be a viable – and more scalable – alternative to human feedback.
Hmm, I went to twitter to see if it had more detail, but found it to be more like “a shorter version of this overall post” rather than “more detail on the implementation details of the paper.” But, here’s a copy of it here for ease-of-reading: