A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don’t already know):
The former are (approximately) symmetric, the latter isn’t (it can be much harder to predict a string front-to-back than back-to-front)
The former calculate compression values as properties of a string (up to choice of UTM). The latter calculates compression values as properties of a string, a data distribution, and a model (and even then doesn’t strictly determine the response, if the model isn’t expressive enough to fully capture the distribution).
So it seems like there’s plenty of room for a measure which is “more sensible” than the former and “more principled” than the latter.
Predicting a string front-to-back is easier than back-to-front. Crutchfield has a very natural measure for this called the causal irreversibility.
In short, given a data stream Crutchfield constructs a minimal (but maximally predictive) forward predictive model S+ which predicts the future given the past (or the next tokens given the context) and the minimal maximally predictive (retrodictive?) backward predictive model S− which predicts the past given the future (or the previous token based on ′ future’ contexts).
The remarkable thing is that these models don’t have to be the same size as shown by a simple example (the ′ random insertion process’ ) whose forward model has 3 states and whose backward model has 4 states.
The causal irreversibility is roughly speaking the difference between the size of the forward and backward model.
A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don’t already know):
The former are (approximately) symmetric, the latter isn’t (it can be much harder to predict a string front-to-back than back-to-front)
The former calculate compression values as properties of a string (up to choice of UTM). The latter calculates compression values as properties of a string, a data distribution, and a model (and even then doesn’t strictly determine the response, if the model isn’t expressive enough to fully capture the distribution).
So it seems like there’s plenty of room for a measure which is “more sensible” than the former and “more principled” than the latter.
Predicting a string front-to-back is easier than back-to-front. Crutchfield has a very natural measure for this called the causal irreversibility.
In short, given a data stream Crutchfield constructs a minimal (but maximally predictive) forward predictive model S+ which predicts the future given the past (or the next tokens given the context) and the minimal maximally predictive (retrodictive?) backward predictive model S− which predicts the past given the future (or the previous token based on ′ future’ contexts).
The remarkable thing is that these models don’t have to be the same size as shown by a simple example (the ′ random insertion process’ ) whose forward model has 3 states and whose backward model has 4 states.
The causal irreversibility is roughly speaking the difference between the size of the forward and backward model.
See this paper for more details.