They typically are uniform, but I think this feels like not the most useful place to be arguing minutia, unless you have a cruxy point underneath I’m not spotting. “The training process for LLMs can optimize for distributional correctness at the expense of sample plausibility, and are functionally different to processes like GANs in this regard” is a clarification with empirically relevant stakes, but I don’t know what the stakes are for this digression.
I was just trying to clarify the limits of autoregressive vs other learning methods. Autoregressive learning is at an apparent disadvantage if P(Xt|Xt−1) is hard to compute and the reverse is easy and low entropy. It can “make up for this” somewhat if it can do a good job of predicting Xt from Xt−2, but it’s still at a disadvantage if, for example, that’s relatively high entropy compared to Xt−1 from Xt. That’s it, I’m satisfied.
They typically are uniform, but I think this feels like not the most useful place to be arguing minutia, unless you have a cruxy point underneath I’m not spotting. “The training process for LLMs can optimize for distributional correctness at the expense of sample plausibility, and are functionally different to processes like GANs in this regard” is a clarification with empirically relevant stakes, but I don’t know what the stakes are for this digression.
I was just trying to clarify the limits of autoregressive vs other learning methods. Autoregressive learning is at an apparent disadvantage if P(Xt|Xt−1) is hard to compute and the reverse is easy and low entropy. It can “make up for this” somewhat if it can do a good job of predicting Xt from Xt−2, but it’s still at a disadvantage if, for example, that’s relatively high entropy compared to Xt−1 from Xt. That’s it, I’m satisfied.