gwern comments on Google’s new 540 billion parameter language model

gwern 4 Apr 2022 22:25 UTC
8 points
I think it would probably not work too well if you mean simply “dump some in like any other text”, because it would be diluted by the hundreds of billions of other tokens and much of it would be ‘wasted’ by being trained on while the model is too stupid to learn the inner monologue technique. (Given that smaller models like 80b-esque models don’t inner-monologue while larger ones like LaMDA & GPT-3 do, presumably the inner-monologue capability only emerges in the last few bits of loss separating the 80b-esque and 200b-esque models and thus fairly late in training, at the point where the 200b-esque models pass the final loss of the 80b-esque models.) If you oversampled an inner-monologue dataset, or trained on it only at the very end (~equivalent to finetuning), or did some sort of prompt-tuning, then it might work. But compared to self-distilling where you just run it on the few-shot-prompt + a bunch of questions & generate arbitrary n to then finetune on, it would be expensive to collect that data, so why do so?