Sorry if this is a spoiler for your next post, but I take issue with the heading “Standard measures of information theory do not work” and the implication that this post contains the pre-Crutchfield state of the art.
The standard approach to this in information theory (which underlies the loss function of autoregressive LMs) isn’t to try to match the Shannon entropy of the marginal distribution of bits (a 50-50 distribution in your post), it’s to treat the generative model as a distribution for each bit conditional on the previous bits and use the cross-entropy of that distribution under the data distribution as the loss function or measure of goodness of the generative model.
So in this example, “look at the previous bits, identify the current position relative to the 01x01x pattern, and predict 0, 1, or [50-50 distribution] as appropriate” is the best you can do (given sufficient data for the 50-50 proportion to be reasonably accurate) and is indeed an accurate model of the process that generated the data.
We can see the pattern and take the current position into account because the distribution is conditional on previous bits.
Predicting 011011011… doesn’t do as well because cross-entropy penalizes unwarranted overconfidence.
Predicting 50-50 for each bit doesn’t do as well because cross-entropy still cares about successful predictions.
(Formally, cross-entropy is an expectation over the data distribution instead of an empirical average over a bunch of sampled data, but the term is used in both cases in practice. “Log[-likelihood] loss” and “the log scoring rule” are other common terms for the empirical version.)
As I said above, this isn’t just a standard information theory approach to this, it’s actually how GPT-3 and other LLMs were trained.
I’m curious about Crutchfield’s thing, but so far not convinced that standard information theory isn’t adequate in this context.
(I think Kolmogorov complexity is also relevant to LLM interpretability, philosophically if not practically, but that’s beyond the scope of this comment.)
A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don’t already know):
The former are (approximately) symmetric, the latter isn’t (it can be much harder to predict a string front-to-back than back-to-front)
The former calculate compression values as properties of a string (up to choice of UTM). The latter calculates compression values as properties of a string, a data distribution, and a model (and even then doesn’t strictly determine the response, if the model isn’t expressive enough to fully capture the distribution).
So it seems like there’s plenty of room for a measure which is “more sensible” than the former and “more principled” than the latter.
Predicting a string front-to-back is easier than back-to-front. Crutchfield has a very natural measure for this called the causal irreversibility.
In short, given a data stream Crutchfield constructs a minimal (but maximally predictive) forward predictive model S+ which predicts the future given the past (or the next tokens given the context) and the minimal maximally predictive (retrodictive?) backward predictive model S− which predicts the past given the future (or the previous token based on ′ future’ contexts).
The remarkable thing is that these models don’t have to be the same size as shown by a simple example (the ′ random insertion process’ ) whose forward model has 3 states and whose backward model has 4 states.
The causal irreversibility is roughly speaking the difference between the size of the forward and backward model.
Yeah follow-up posts will definitely get into that!
To be clear: (1) the initial posts won’t be about Crutchfield work yet—just introducing some background material and overarching philosophy (2) The claim isn’t that standard measures of information theory are bad. To the contrary! If anything we hope these posts will be somewhat of an ode to information theory as a tool for interpretability.
Adam wanted to add a lot of academic caveats—I was adamant that we streamline the presentation to make it short and snappy for a general audience but it appears I might have overshot ! I will make an edit to clarify. Thank you!
I agree with you about the importance of Kolmogorov complexity philosophically and would love to read a follow-up post on your thoughts about Kolmogorov complexity and LLM interpretability:)
Sorry if this is a spoiler for your next post, but I take issue with the heading “Standard measures of information theory do not work” and the implication that this post contains the pre-Crutchfield state of the art.
The standard approach to this in information theory (which underlies the loss function of autoregressive LMs) isn’t to try to match the Shannon entropy of the marginal distribution of bits (a 50-50 distribution in your post), it’s to treat the generative model as a distribution for each bit conditional on the previous bits and use the cross-entropy of that distribution under the data distribution as the loss function or measure of goodness of the generative model.
So in this example, “look at the previous bits, identify the current position relative to the 01x01x pattern, and predict 0, 1, or [50-50 distribution] as appropriate” is the best you can do (given sufficient data for the 50-50 proportion to be reasonably accurate) and is indeed an accurate model of the process that generated the data.
We can see the pattern and take the current position into account because the distribution is conditional on previous bits.
Predicting 011011011… doesn’t do as well because cross-entropy penalizes unwarranted overconfidence.
Predicting 50-50 for each bit doesn’t do as well because cross-entropy still cares about successful predictions.
(Formally, cross-entropy is an expectation over the data distribution instead of an empirical average over a bunch of sampled data, but the term is used in both cases in practice. “Log[-likelihood] loss” and “the log scoring rule” are other common terms for the empirical version.)
As I said above, this isn’t just a standard information theory approach to this, it’s actually how GPT-3 and other LLMs were trained.
I’m curious about Crutchfield’s thing, but so far not convinced that standard information theory isn’t adequate in this context.
(I think Kolmogorov complexity is also relevant to LLM interpretability, philosophically if not practically, but that’s beyond the scope of this comment.)
A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don’t already know):
The former are (approximately) symmetric, the latter isn’t (it can be much harder to predict a string front-to-back than back-to-front)
The former calculate compression values as properties of a string (up to choice of UTM). The latter calculates compression values as properties of a string, a data distribution, and a model (and even then doesn’t strictly determine the response, if the model isn’t expressive enough to fully capture the distribution).
So it seems like there’s plenty of room for a measure which is “more sensible” than the former and “more principled” than the latter.
Predicting a string front-to-back is easier than back-to-front. Crutchfield has a very natural measure for this called the causal irreversibility.
In short, given a data stream Crutchfield constructs a minimal (but maximally predictive) forward predictive model S+ which predicts the future given the past (or the next tokens given the context) and the minimal maximally predictive (retrodictive?) backward predictive model S− which predicts the past given the future (or the previous token based on ′ future’ contexts).
The remarkable thing is that these models don’t have to be the same size as shown by a simple example (the ′ random insertion process’ ) whose forward model has 3 states and whose backward model has 4 states.
The causal irreversibility is roughly speaking the difference between the size of the forward and backward model.
See this paper for more details.
Yeah follow-up posts will definitely get into that!
To be clear: (1) the initial posts won’t be about Crutchfield work yet—just introducing some background material and overarching philosophy (2) The claim isn’t that standard measures of information theory are bad. To the contrary! If anything we hope these posts will be somewhat of an ode to information theory as a tool for interpretability.
Adam wanted to add a lot of academic caveats—I was adamant that we streamline the presentation to make it short and snappy for a general audience but it appears I might have overshot ! I will make an edit to clarify. Thank you!
I agree with you about the importance of Kolmogorov complexity philosophically and would love to read a follow-up post on your thoughts about Kolmogorov complexity and LLM interpretability:)