I asked it to make a joke with a certain expression in the punchline. It consistently puts the expression in the first part of the “joke”, even when prodded to do it right. Disappointing.
Huh, I’m guessing that’s a limitation of the way it generates things/the way it learned the distribution? I’ve never seen such a clear illustration of that before. Prediction and action really are distinct tasks?
On reflection, does OpenAI only train it to predict the next word, wouldn’t they also train it to predict the previous word, or words between?
I’ve no idea what OpenAI actually does, but just as a matter of general probabilistic modeling, a model that has learned to predict the next word given previous words has also implicitly learned a model of the joint distribution of all words. (Since the joint probability of a, b, c is just P(a)P(b|a)P(c|a,b).) Given the joint distribution of all words, you can go backwards and deduce the conditional distribution of each word given the following words. Or you can get the conditional distribution of a word given all words both before and after. These conditional distributions are probably harder to get computationally than the forward conditionals that the model directly gives, but the computations are probably not completely infeasible.
So in theory there’s no benefit from training on the backwards sequence as well as the forward sequence, though in practice it’s conceivable that there could be (since the training procedure is no doubt only an approximation to an ideal statistical procedure, and this approximation might conceivably work better when training goes both ways, though off hand this seems unlikely).
I asked it to make a joke with a certain expression in the punchline. It consistently puts the expression in the first part of the “joke”, even when prodded to do it right. Disappointing.
Huh, I’m guessing that’s a limitation of the way it generates things/the way it learned the distribution? I’ve never seen such a clear illustration of that before. Prediction and action really are distinct tasks?
On reflection, does OpenAI only train it to predict the next word, wouldn’t they also train it to predict the previous word, or words between?
I’ve no idea what OpenAI actually does, but just as a matter of general probabilistic modeling, a model that has learned to predict the next word given previous words has also implicitly learned a model of the joint distribution of all words. (Since the joint probability of a, b, c is just P(a)P(b|a)P(c|a,b).) Given the joint distribution of all words, you can go backwards and deduce the conditional distribution of each word given the following words. Or you can get the conditional distribution of a word given all words both before and after. These conditional distributions are probably harder to get computationally than the forward conditionals that the model directly gives, but the computations are probably not completely infeasible.
So in theory there’s no benefit from training on the backwards sequence as well as the forward sequence, though in practice it’s conceivable that there could be (since the training procedure is no doubt only an approximation to an ideal statistical procedure, and this approximation might conceivably work better when training goes both ways, though off hand this seems unlikely).