Ethan Perez comments on “Unsupervised” translation as an (intent) alignment problem

Ethan Perez 20 Oct 2020 3:31 UTC
1 point
0
What encourages the helper model to generate correct explanations as opposed to false/spurious ones?
I.e., let’s say the text is a list of fruit, and the correct next word is Klingon for “pineapple”. I’m imagining that the helper model could just say “The next word is [Klingon for pineapple]” or give an alternate/spurious explanation of the Klingon text (“The text is discussing a spiky fruit that goes on pizza”). Both of the above, unhelpful/spurious explanations would make me predict the next Klingon word correctly.