Hi Erik! Thank you for the careful read, this is awesome!
Regarding proposition 1 - I think you’re right, that counter-example disproves the proposition. The proposition we were actually going for was limB→∞P[(sa,s1,…,sB)]=0. , i.e. the probability without the end of the bridge! I’ll fix this in the post.
Regarding proposition II—janus had the same intuition and I tried to explain it with the following argument: When the distance between tokens becomes large enough, then eventually all bridges between the first token and an arbitrary second token end up with approximately the same “cost”. At that point, only the prior likelihood of the token will decide which token gets sampled. So Proposition II implies something like P(sb)∼exp[−(B+1)maxP(sa,s1,…,sB,sb)], or that in the limit “the probability of the most likely sequence ending in sb will be (when appropriately normalized) proportional to the probability of sb”, which seems sensible? (assuming something like ergodicity). Although I’m now becoming a bit suspicious about the sign of the exponent, perhaps there is a “log” or a minus missing on the RHS… I’ll think about that a bit more.
The proposition we were actually going for was limB→∞P[(sa,s1,…,sB)]=0., i.e. the probability without the end of the bridge!
In that case, I agree the monotonically decreasing version of the statement is correct. I think the limit still isn’t necessarily zero, for the reasons I mention in my original comment. (Though I do agree it will be zero under somewhat reasonable assumptions, and in particular for LMs)
So Proposition II implies something like P(sb)∼exp[−(B+1)maxP(sa,s1,…,sB,sb)], or that in the limit “the probability of the most likely sequence ending in sb will be (when appropriately normalized) proportional to the probability of sb”, which seems sensible?
One crux here is the “appropriately normalized”: why should the normalization be linear, i.e. just B + 1? I buy that there are some important systems where this holds, and maybe it even holds for LMs, but it certainly won’t be true in general (e.g. sometimes you need exponential normalization). Even modulo that issue, the claim still isn’t obvious to me, but that may be a good point to start (i.e. an explanation of where the normalization factor comes from would plausibly also clear up my remaining skepticism).
Hi Erik! Thank you for the careful read, this is awesome!
Regarding proposition 1 - I think you’re right, that counter-example disproves the proposition. The proposition we were actually going for was limB→∞P[(sa,s1,…,sB)]=0. , i.e. the probability without the end of the bridge! I’ll fix this in the post.
Regarding proposition II—janus had the same intuition and I tried to explain it with the following argument: When the distance between tokens becomes large enough, then eventually all bridges between the first token and an arbitrary second token end up with approximately the same “cost”. At that point, only the prior likelihood of the token will decide which token gets sampled. So Proposition II implies something like P(sb)∼exp[−(B+1)maxP(sa,s1,…,sB,sb)], or that in the limit “the probability of the most likely sequence ending in sb will be (when appropriately normalized) proportional to the probability of sb”, which seems sensible? (assuming something like ergodicity). Although I’m now becoming a bit suspicious about the sign of the exponent, perhaps there is a “log” or a minus missing on the RHS… I’ll think about that a bit more.
In that case, I agree the monotonically decreasing version of the statement is correct. I think the limit still isn’t necessarily zero, for the reasons I mention in my original comment. (Though I do agree it will be zero under somewhat reasonable assumptions, and in particular for LMs)
One crux here is the “appropriately normalized”: why should the normalization be linear, i.e. just B + 1? I buy that there are some important systems where this holds, and maybe it even holds for LMs, but it certainly won’t be true in general (e.g. sometimes you need exponential normalization). Even modulo that issue, the claim still isn’t obvious to me, but that may be a good point to start (i.e. an explanation of where the normalization factor comes from would plausibly also clear up my remaining skepticism).