scasper comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

scasper 17 Dec 2022 15:14 UTC
LW: 13 AF: 6
4
AF
Nice post. I’ll nitpick one thing.
In the paper, the approach was based on training a linear probe to differentiate between true and untrue question answer pairs. I believe I mentioned to you at one point that “contrastive” seems more precise than “unsupervised” to describe this method. To carry out an approach like this, it’s not enough to have or create a bunch of data. One needs the ability to reliably find subsets of the data that contrast. In general, this would be as hard as labeling. But when using boolean questions paired with “yes” and “no” answers, this is easy and might be plenty useful in general. I wouldn’t expect it to be tractable though in practice to reliably get good answers to open-ended questions using a set of boolean ones in this way. Supervision seems useful too because it seems to offer a more general tool.
- Collin 20 Dec 2022 1:19 UTC
  1 point
  1
  Parent
  Thanks! I personally think of it as both “contrastive” and “unsupervised,” but I do think similar contrastive techniques can be applied in the supervised case too—as some prior work like https://arxiv.org/abs/1607.06520 has done. I agree it’s less clear how to do this for open-ended questions compared to boolean T/F questions, but I think the latter captures the core difficulty of the problem. For example, in the simplest case you could do rejection sampling for controllable generation of open-ended outputs. Alternatively, maybe you want to train a model to generate text that both appears useful (as assessed by human supervision) while also being correct (as assessed by a method like CCS). So I agree supervision seems useful too for some parts of the problem.