Steve Kommrusch comments on Transformers Represent Belief State Geometry in their Residual Stream

Steve Kommrusch 16 May 2024 22:06 UTC
1 point
0
This is very interesting work, showing the fractal graph is a good way to visualize the predictive model being learned. I’ve had many conversations with folks who struggle with the idea ‘the model is just predicting the next token, how can it be doing anything interesting’?. My standard response had been that conceptually the transformer model matches up tokens at the first layer (using the key and query vectors), then matches up sentences a few layers up, and then paragraphs a few layers above that; hence the model, when presented with an input, was not just responding with ‘the next most likely token’, but more accurately ‘the best token to use to start the best sentence to start the best paragraph to answer the question’. Which usually helped get the complexity across; but I like the learned fractal of the belief state and will see how well I can use that in the future.
For future work, I think it would be interesting to tease out how the system learns 2 interacting state machines (this may give hints regarding its ability to generalize different actors in the world). For example, consider another 3-state HMM with the same transition probabilities but behaving independent of the 1st HMM. Then have the probability of outputting A,B, or C be the average of the arcs taken on the 2 HMMs each step. For example, if the 1st HMM is in H0 and stays in H0 it gives a 60% chance of generating A and a 20% chance for B and C, while if the 2nd HMM is in H2 and stays in H2, it gives 20% for A and B and 60% for C, so the overall output probability is 40% A, 20% B, 40%C for my example. Now certainly this is a 9 state HMM (3x3), but it’s more simply represented as two 3-state HMMs, what would the neural network learn? What if you combined 3 HMMs this way, so the single HMM is 3x3x3=27 states, but the simpler representation is 3+3+3=9? Again, my goal here would be to understand how the system might model multiple agents in the world given limited visibility to the agents directly. Perhaps there is a cleaner way to explore the same question.