We have previously seen that a major challenge for alignment is that our models may learn <@inaccessible information@>(@Inaccessible information@) that we cannot extract from them, because we do not know how to provide a learning signal to train them to output such information. This post proposes unsupervised translation as a particular concrete problem to ground this out.
Suppose we have lots of English text, and lots of Klingon text, but no translations from English to Klingon (or vice versa), and no bilingual speakers. If we train GPT on the text, it will probably develop a good understanding of both English and Klingon, such that it “should” have the ability to translate between the two (at least approximately). How can we get it to actually (try to) do so? Existing methods (both in unsupervised translation and in AI alignment) do not seem to meet this bar.
One vague hope is that we could train a helper agent such that a human can perform next-word prediction on Klingon with the assistance of the helper agent, using a method like the one in Learning the prior (AN #109).
Planned summary for the Alignment Newsletter: