One nice little prediction from this approach: you’d expect the first few tokens to have denser (as in SAE) features, as there is less context, so the “HMM” could be in a broad range of states. Whereas once you’ve seen more tokens, you have much more information so the state is pinned down more precisely and you’d expect to be denser.
There’s also a big literature from computational neuroscience about how you represent probabilities. This is suggesting a “mean parameter code”, where the LLM activations are a function of E[z| data]. But lots of other possibilities are available, e.g. see:
One nice little prediction from this approach: you’d expect the first few tokens to have denser (as in SAE) features, as there is less context, so the “HMM” could be in a broad range of states. Whereas once you’ve seen more tokens, you have much more information so the state is pinned down more precisely and you’d expect to be denser.
There’s also a big literature from computational neuroscience about how you represent probabilities. This is suggesting a “mean parameter code”, where the LLM activations are a function of E[z| data]. But lots of other possibilities are available, e.g. see:
http://www.gatsby.ucl.ac.uk/teaching/courses/tn1-2021/slides/uncert-slides.pdf