Olli Järviniemi comments on Transformers Represent Belief State Geometry in their Residual Stream

Olli Järviniemi 5 Jun 2024 14:15 UTC
3 points
0
There is a 2-state generative HMM such that the optimal predictor of the output of said generative model provably requires an infinite number of states. This is for any model of computation, any architecture.
Huh, either I’m misunderstanding or this is wrong.
If you have Hidden Markov Models like in this post (so you have a finite number of states, fixed transition probabilities between them and outputs depending on the transitions), then the optimal predictor is simple: do Bayesian updates on the current hidden state based on the observations. For each new observation, you only need to do O(states) computations. Furthermore, this is very parallelizable, requiring only O(1) serial steps per observation.
- Alexander Gietelink Oldenziel 9 Jun 2024 17:56 UTC
  6 points
  0
  Parent
  You are absolutely right and I am of course absolutely and embarrasingly wrong.
  The minimal optimal predictor as a Hidden Markov Model of the simple nonunfilar is indeed infinite. This implies that any other architecture must be capable of expressing infinitely many states—but this is quite a weak statement—it’s very easy for a machine to dynamically express finitely many states with finite memory. In particular, a transformer should absolutely be able to learn the MSP of the epsilon machine of the simple nonunifilar source—indeed it can even be solved analytically.
  This was an embarrasing mistake I should not have made. I regret my rash overconfidence—I should have taken a moment to think it through since the statement was obviously wrong. Thank you for pointing it out.
- Alexander Gietelink Oldenziel 5 Jun 2024 16:41 UTC
  1 point
  0
  Parent
  Behold