Collin comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Collin 20 Dec 2022 1:19 UTC
1 point
1
Thanks! I personally think of it as both “contrastive” and “unsupervised,” but I do think similar contrastive techniques can be applied in the supervised case too—as some prior work like https://arxiv.org/abs/1607.06520 has done. I agree it’s less clear how to do this for open-ended questions compared to boolean T/F questions, but I think the latter captures the core difficulty of the problem. For example, in the simplest case you could do rejection sampling for controllable generation of open-ended outputs. Alternatively, maybe you want to train a model to generate text that both appears useful (as assessed by human supervision) while also being correct (as assessed by a method like CCS). So I agree supervision seems useful too for some parts of the problem.