Ofer comments on “Unsupervised” translation as an (intent) alignment problem

Ofer 30 Sep 2020 16:50 UTC
LW: 3 AF: 3
AF
Some tentative thoughts:
Re Debate:
Making things worse, to interpret each usage they’d need to agree about the meaning of the rest of the phrase — -which isn’t necessarily any simpler than the original disagreement about “qapla.”
Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).
Also, one might try to use GPT to complete prompts such as:
The researchers analyzed the Klingon phrase “מהדקי נייר” and concluded it roughly means
In both of these approaches we still need to deal with the potential problem of catastrophic inner alignment failures occurring before the point where we have sufficiently useful helper models. [EDIT: and in the Debate-based approach there’s also an outer alignment problem: a player may try to manipulate the judge into choosing them as the winner.]
- paulfchristiano 30 Sep 2020 18:03 UTC
  LW: 5 AF: 5
  AF Parent
  The researchers analyzed the Klingon phrase “מהדקי נייר” and concluded it roughly means
  If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn’t actually going to expand what humans can do.
  Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).
  This is basically what the helper model does, except:
  - For competitiveness you should learn and evaluate the dictionary at the same time you are training the model, running a debate experiment many times where debaters have to output a full dictionary would likely be prohibitively expensive.
  - Most knowledge about language isn’t easily captured in a dictionary (for example, a human using a Spanish-English dictionary is a mediocre translator), so we’d prefer have a model that answers questions about meaning than have a model that outputs a static dictionary.
  - I don’t know what standard you want to use for “helpful for understanding the passage” but I think “helps predict the next word correctly” is probably the best approach (since the goal is to be competitive and that’s how GPT learned).
  After making those changes we’re back at the learning the prior proposal.
  I think that proposal may work passably here because we can potentially get by with a really crude prior—basically we think “the helper should mostly just explain the meaning of terms” and then we don’t need to be particularly opinionated about which meanings are more plausible. I agree that the discussion in the section “A vague hope” is a little bit too pessimistic for the given context of unaligned translation.
  - Ofer 30 Sep 2020 20:11 UTC
    LW: 1 AF: 1
    AF Parent
    
    If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn’t actually going to expand what humans can do.
    
    Agreed. Perhaps it’s possible to iteratively train GPT models in an Amplification-like setup, where in each iteration we add to the English training corpus some newly possible translations; aiming to end up with something like an HCH translator. (We may not need to train a language model from scratch in each iteration; at the extreme, we just to do fine-tuning on the new translations.)