gwern comments on Ngo and Yudkowsky on alignment difficulty

gwern 17 Nov 2021 2:15 UTC
5 points

with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as “official published book” in the training set, nor will they have the tagged word count.

If you include something like reviews or quotes praising its accuracy, then you’re moving towards Decision Transformer territory with feedback loops...