Proposition 1 is wrong. The coin flips that are eternally 0 0 0 0 are a counterexample. If all the transition probabilities are 1, which is entirely possible, the limiting probability is 1 and not 0.
So, a softmax can never emit a probability of 0 or 1, maybe they were implicitly assuming the model ends in a softmax (as is the common case)? Regardless, the proof is still wrong if a model is allowed unbounded context, as an infinite product of positive numbers less than 1 can still be nonzero. For example, if the probability of emitting another ” 0“ is even just as high as $1 - \frac1{n^{1.001}}$ after already having emitted $n$ copies of ” 0”, then the limiting probability is still nonzero.
But if the model has a finite context and ends in a softmax then I think there is some minimum probability of transitioning to a given token, and then the proposition is true. Maybe that was implicitly assumed?
Thanks for pointing this out! This argument made it into the revised version. I think because of finite precision it’s reasonable to assume that such an ε always exists in practice (if we also assume that the probability gets rounded to something < 1).
Technically correct, thanks for pointing that out! This comment (and the ones like it) was the motivation for introducing the “non-degenerate” requirement into the text. In practice, the proposition holds pretty well—although I agree it would nice to have a deeper understanding of when to expect the transition rule to be “non-degenerate”
Proposition 1 is wrong. The coin flips that are eternally
0 0 0 0
are a counterexample. If all the transition probabilities are 1, which is entirely possible, the limiting probability is 1 and not 0.So, a softmax can never emit a probability of 0 or 1, maybe they were implicitly assuming the model ends in a softmax (as is the common case)? Regardless, the proof is still wrong if a model is allowed unbounded context, as an infinite product of positive numbers less than 1 can still be nonzero. For example, if the probability of emitting another ” 0“ is even just as high as $1 - \frac1{n^{1.001}}$ after already having emitted $n$ copies of ” 0”, then the limiting probability is still nonzero.
But if the model has a finite context and ends in a softmax then I think there is some minimum probability of transitioning to a given token, and then the proposition is true. Maybe that was implicitly assumed?
Yeah, I think it was implicitly assumed that there existed some ε>0 such that no token ever had probability >1−ε.
Thanks for pointing this out! This argument made it into the revised version. I think because of finite precision it’s reasonable to assume that such an ε always exists in practice (if we also assume that the probability gets rounded to something < 1).
Technically correct, thanks for pointing that out! This comment (and the ones like it) was the motivation for introducing the “non-degenerate” requirement into the text. In practice, the proposition holds pretty well—although I agree it would nice to have a deeper understanding of when to expect the transition rule to be “non-degenerate”