(By applying Bayes rule, and then renormalizing). They found that emprically, it worked better to instead find
argmax[e] Pr(e)^1.5 * Pr(f|e)
What this means is that whatever’s generating Pr(f|e) is generating overconfident numbers (or equivalently in this case, that whatever generates Pr(e) is generating underconfident numbers). This corrects for that.
It’s a little confusing that this was presented as a modification to Bayes’ rule, rather than a calibration factor applied to the underlying estimators, but it’s really the latter. The reason for putting it here, rather than there, is probably because if the calibration were done to the original estimates, it would introduce a spurious degree of freedom, since only the relative weights matter.
Excellent explanation. I would add that the source of this overconfidence is not a mystery at all. Models for estimating Pr(f|e) are so ridiculously simplistic that a layperson would laugh us out if we explained them to her in plain English instead of formulas. For example, P(f|e) was sometimes defined as the probability that we can produce f from e by first applying a randomly chosen lexicon translation for each word of e, and then do a random local reordering of words. Here the whole responsibility of finding a random reordering that leads to a grammatical English sentence rests on the shoulders of Pr(e). It’s almost like the translation model spits out a bag of words, and the language model has to assemble them into a chain of words.
(The above simple example is far from being state of the art, but actual state of the art it is not that much more realistic either.)
A little context: translating foreign (f) to English (e) is finding the most-probable English text e for a given foreign phrase,
(By applying Bayes rule, and then renormalizing). They found that emprically, it worked better to instead find
What this means is that whatever’s generating Pr(f|e) is generating overconfident numbers (or equivalently in this case, that whatever generates Pr(e) is generating underconfident numbers). This corrects for that.
It’s a little confusing that this was presented as a modification to Bayes’ rule, rather than a calibration factor applied to the underlying estimators, but it’s really the latter. The reason for putting it here, rather than there, is probably because if the calibration were done to the original estimates, it would introduce a spurious degree of freedom, since only the relative weights matter.
Excellent explanation. I would add that the source of this overconfidence is not a mystery at all. Models for estimating Pr(f|e) are so ridiculously simplistic that a layperson would laugh us out if we explained them to her in plain English instead of formulas. For example, P(f|e) was sometimes defined as the probability that we can produce f from e by first applying a randomly chosen lexicon translation for each word of e, and then do a random local reordering of words. Here the whole responsibility of finding a random reordering that leads to a grammatical English sentence rests on the shoulders of Pr(e). It’s almost like the translation model spits out a bag of words, and the language model has to assemble them into a chain of words. (The above simple example is far from being state of the art, but actual state of the art it is not that much more realistic either.)