Were you in fact surprised? If so, why? (This is a straightforward consequence of the good regulator theorem[1].)
In general I’d encourage you to carefully track claims about transformers, HMM-predictors, and LLMs, and to distinguish between trained NNs and the training process. In this writeup, all of these are quite blended.
This is a straightforward consequence of the good regulator theorem
IIUC, the good regulator theorem doesn’t say anything about how the model of the system should be represented in the activations of the residual stream. I think the potentially surprising part is that the model is recoverable with a linear probe.
The next token prediction probabilities (ie. the thing we explicitly train the transformer to do)
The structure of the data generating model (ie. the thing the good regulator theorem talks about, if I understand the good regulator theorem, which I might not)
The first would be not surprising because it’s literally what our loss function asks for, and the second might not be that surprising since this is the intuitive thing people often think about when we say “model of the world.” But the MSP structure is neither of those things. It’s the structure of inference over the model of the world, which is quite a different beast than the model of the world.
Others might not find it as surprising as I did—everyone is working off their own intuitions.
edit: also I agree with what Kave said about the linear representation.
I guess my question would be ‘how else did you think a well-generalising sequence model would achieve this?’ Like, what is a sufficient world model but a posterior over HMM states in this case? This is what GR theorem asks. (Of course, a poorly-fit model might track extraneous detail or have a bad posterior.)
From your preamble and your experiment design, it looks like you correctly anticipated the result, so this should not have been a surprise (to you). In general I object to being sold something as surprising which isn’t (it strikes me as a lesser-noticed and perhaps oft-inadvertent rhetorical dark art and I see it on the rise on LW, which is sad).
That said, since I’m the only one objecting here, you appear to be more right about the surprisingness of this!
The linear probe is new news (but not surprising?) on top of GR, I agree. But the OP presents the other aspects as the surprises, and not this.
I agree with you that the new/surprising thing is the linearity of the probe. Also I agree that not entirely clear how surprising & new linearity of the probe is.
If you understand how the causal states construction & the MSP works in computational mechanics the experimental results isn’t surprising. Indeed, it can’t be any other way!
That’s exactly the magic of the definition of causal states.
What one person might find surprising or new another thinks trivial. The subtle magic of the right theoretical framework is that it makes the complex simple, surprising phenomena apparent.
Before learning about causal states I would have not even considered that there is a unique (!) optimal minimal predictor canonical constructible from the data. Nor that the geometry of synchronizing belief states is generically a fractal.
Of course, once one has properly internalized the definitions this is almost immediate. Pretty pictures can be helpful in building that intuition !
Adam and I (and many others) have been preaching the gospel of computational mechanics for a while now. Most of it has fallen on deaf ears before. Like you I have been (positively!) surprised and amused by the sudden outpouring of interest. No doubt it’s in part a the testimony to the Power of the Visual! Never look a gift horse in the mouth ! _
I would say the parts of computational mechanics I am really excited are a little deeper—downstream of causal states & the MSP. This is just a taster.
I’m confused & intrigued by your insistence that this is follows from the good regulator theorem. Like Adam I don’t understand it. It is my understanding is that the original ‘theorem’ was wordcelled nonsense but that John has been able to formulate a nontrivial version of the theorem.
My experience is that it the theorem is often invoked in a handwavey way that leaves me no less confused than before. No doubt due to my own ignorance !
I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.
Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn’t nonsense? But in any case, I agree the original presentation is dreadful. John’s is much better.
I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.
A quick go at it, might have typos.
Suppose we have
X (hidden) state
Y output/observation
and a predictor
S (predictor) state
^Y predictor output
R the reward or goal or what have you (some way of scoring ‘was ^Y right?’)
with structure
X→YX→RY→S→^Y→R
Then GR trivially says S (predictor state) should model the posterior P(X|Y).
Now if these are all instead processes (time-indexed), we have HMM
Xt (hidden) states
Yt observations
and predictor process
St (predictor) states
^Yt predictions
Rt rewards
with structure
Xt→Xt+1Xt→YtSt−1→StYt→St→^Yt+1→Rt+1Yt+1→Rt+1
Drawing together (Xt+1,Yt+1,^Yt+1,Rt+1) as Gt the ‘goal’, we have a GR motif
Xt→YtYt→St→GtSt−1→StXt→Gt
so St must model P(Xt|St−1,Yt); by induction that is P(Xt|S0,Y1,...,Yt).
Nice explanation of MSP and good visuals.
Were you in fact surprised? If so, why? (This is a straightforward consequence of the good regulator theorem[1].)
In general I’d encourage you to carefully track claims about transformers, HMM-predictors, and LLMs, and to distinguish between trained NNs and the training process. In this writeup, all of these are quite blended.
John has a good explication here
IIUC, the good regulator theorem doesn’t say anything about how the model of the system should be represented in the activations of the residual stream. I think the potentially surprising part is that the model is recoverable with a linear probe.
It’s surprising for a few reasons:
The structure of the points in the simplex is NOT
The next token prediction probabilities (ie. the thing we explicitly train the transformer to do)
The structure of the data generating model (ie. the thing the good regulator theorem talks about, if I understand the good regulator theorem, which I might not)
The first would be not surprising because it’s literally what our loss function asks for, and the second might not be that surprising since this is the intuitive thing people often think about when we say “model of the world.” But the MSP structure is neither of those things. It’s the structure of inference over the model of the world, which is quite a different beast than the model of the world.
Others might not find it as surprising as I did—everyone is working off their own intuitions.
edit: also I agree with what Kave said about the linear representation.
I guess my question would be ‘how else did you think a well-generalising sequence model would achieve this?’ Like, what is a sufficient world model but a posterior over HMM states in this case? This is what GR theorem asks. (Of course, a poorly-fit model might track extraneous detail or have a bad posterior.)
From your preamble and your experiment design, it looks like you correctly anticipated the result, so this should not have been a surprise (to you). In general I object to being sold something as surprising which isn’t (it strikes me as a lesser-noticed and perhaps oft-inadvertent rhetorical dark art and I see it on the rise on LW, which is sad).
That said, since I’m the only one objecting here, you appear to be more right about the surprisingness of this!
The linear probe is new news (but not surprising?) on top of GR, I agree. But the OP presents the other aspects as the surprises, and not this.
I agree with you that the new/surprising thing is the linearity of the probe. Also I agree that not entirely clear how surprising & new linearity of the probe is.
If you understand how the causal states construction & the MSP works in computational mechanics the experimental results isn’t surprising. Indeed, it can’t be any other way! That’s exactly the magic of the definition of causal states.
What one person might find surprising or new another thinks trivial. The subtle magic of the right theoretical framework is that it makes the complex simple, surprising phenomena apparent.
Before learning about causal states I would have not even considered that there is a unique (!) optimal minimal predictor canonical constructible from the data. Nor that the geometry of synchronizing belief states is generically a fractal. Of course, once one has properly internalized the definitions this is almost immediate. Pretty pictures can be helpful in building that intuition !
Adam and I (and many others) have been preaching the gospel of computational mechanics for a while now. Most of it has fallen on deaf ears before. Like you I have been (positively!) surprised and amused by the sudden outpouring of interest. No doubt it’s in part a the testimony to the Power of the Visual! Never look a gift horse in the mouth ! _
I would say the parts of computational mechanics I am really excited are a little deeper—downstream of causal states & the MSP. This is just a taster.
I’m confused & intrigued by your insistence that this is follows from the good regulator theorem. Like Adam I don’t understand it. It is my understanding is that the original ‘theorem’ was wordcelled nonsense but that John has been able to formulate a nontrivial version of the theorem. My experience is that it the theorem is often invoked in a handwavey way that leaves me no less confused than before. No doubt due to my own ignorance !
I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.
Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn’t nonsense? But in any case, I agree the original presentation is dreadful. John’s is much better.
A quick go at it, might have typos.
Suppose we have
X (hidden) state
Y output/observation
and a predictor
S (predictor) state
^Y predictor output
R the reward or goal or what have you (some way of scoring ‘was ^Y right?’)
with structure
X→YX→RY→S→^Y→R
Then GR trivially says S (predictor state) should model the posterior P(X|Y).
Now if these are all instead processes (time-indexed), we have HMM
Xt (hidden) states
Yt observations
and predictor process
St (predictor) states
^Yt predictions
Rt rewards
with structure
Xt→Xt+1Xt→YtSt−1→StYt→St→^Yt+1→Rt+1Yt+1→Rt+1
Drawing together (Xt+1,Yt+1,^Yt+1,Rt+1) as Gt the ‘goal’, we have a GR motif
Xt→YtYt→St→GtSt−1→StXt→Gt
so St must model P(Xt|St−1,Yt); by induction that is P(Xt|S0,Y1,...,Yt).