Suppose that we want to translate between English and an alien language (Klingon). We have plenty of Klingon text, and separately we have plenty of English text, but it’s not matched up and there are no bilingual speakers.
We train GPT on a mix of English and Klingon text and find that it becomes fluent in both. In some sense this model “knows” quite a lot about both Klingon and English, and so it should be able to read a sentence in one language, understand it, and then express the same idea in the other language. But it’s not clear how we could train a translation model.
So he talks about the difficulty of judging whether an unsupervised translation is good, since there are no independent raters who understand both English and Alienese, so translations can’t be improved with RLHF.
He posted this before OpenAI succeeded in applying RLHF to LLMs. I now think RLHF generally doesn’t improve translation ability much anyway compared to prompting a foundation model. Based on what we have seen, it seems generally hard to improve raw LLM abilities with RLHF. Even if RLHF does improve translation relative to some good prompting, I would assume doing RLHF on some known translation pairs (like English and Chinese) would also help for other pairs which weren’t mentioned in the RLHF data. E.g. by encouraging the model to mention it’s uncertainty about the meaning of certain terms when doing translations. Though again, this could likely be achieved with prompting as well.
He also mentions the more general problem of language models not knowing why they believe what they believe. If a model translates X as Y rather than as Z, it can’t provide the reasons for its decision (like pointing to specific statistics about the training data), except via post hoc rationalisation / confabulation.
Paul Christiano discusses this in “Unsupervised” translation as an (intent) alignment problem (2020)
From the post:
So he talks about the difficulty of judging whether an unsupervised translation is good, since there are no independent raters who understand both English and Alienese, so translations can’t be improved with RLHF.
He posted this before OpenAI succeeded in applying RLHF to LLMs. I now think RLHF generally doesn’t improve translation ability much anyway compared to prompting a foundation model. Based on what we have seen, it seems generally hard to improve raw LLM abilities with RLHF. Even if RLHF does improve translation relative to some good prompting, I would assume doing RLHF on some known translation pairs (like English and Chinese) would also help for other pairs which weren’t mentioned in the RLHF data. E.g. by encouraging the model to mention it’s uncertainty about the meaning of certain terms when doing translations. Though again, this could likely be achieved with prompting as well.
He also mentions the more general problem of language models not knowing why they believe what they believe. If a model translates X as Y rather than as Z, it can’t provide the reasons for its decision (like pointing to specific statistics about the training data), except via post hoc rationalisation / confabulation.