Alexander Gietelink Oldenziel comments on Transformers Represent Belief State Geometry in their Residual Stream

Alexander Gietelink Oldenziel 24 Apr 2024 12:29 UTC
3 points
−1
Depending on what one means by ‘learn’ this is provably impossible. The reason has nothing to do with the transformer architecture (which one shouldn’t think of as a canonical architecture in the grand scheme of things anyway).

There is a 2-state generative HMM such that the optimal predictor of the output of said generative model provably requires an infinite number of states. This is for any model of computation, any architecture.

Of course, that’s maybe not what you intend by ‘learn’. If you mean by ‘learn’ express the underlying function of an HMM then the answer is yes by the Universal Approximation Theorem (a very fancy name for a trivial application of the Stone-Weierstrass theorem).

Hope this helped. 😄
- Olli Järviniemi 5 Jun 2024 14:15 UTC
  3 points
  0
  Parent
  There is a 2-state generative HMM such that the optimal predictor of the output of said generative model provably requires an infinite number of states. This is for any model of computation, any architecture.
  Huh, either I’m misunderstanding or this is wrong.
  If you have Hidden Markov Models like in this post (so you have a finite number of states, fixed transition probabilities between them and outputs depending on the transitions), then the optimal predictor is simple: do Bayesian updates on the current hidden state based on the observations. For each new observation, you only need to do O(states) computations. Furthermore, this is very parallelizable, requiring only O(1) serial steps per observation.
  - Alexander Gietelink Oldenziel 9 Jun 2024 17:56 UTC
    6 points
    0
    Parent
    You are absolutely right and I am of course absolutely and embarrasingly wrong.
    The minimal optimal predictor as a Hidden Markov Model of the simple nonunfilar is indeed infinite. This implies that any other architecture must be capable of expressing infinitely many states—but this is quite a weak statement—it’s very easy for a machine to dynamically express finitely many states with finite memory. In particular, a transformer should absolutely be able to learn the MSP of the epsilon machine of the simple nonunifilar source—indeed it can even be solved analytically.
    This was an embarrasing mistake I should not have made. I regret my rash overconfidence—I should have taken a moment to think it through since the statement was obviously wrong. Thank you for pointing it out.
  - Alexander Gietelink Oldenziel 5 Jun 2024 16:41 UTC
    1 point
    0
    Parent
    Behold
- Niclas Kupper 28 Apr 2024 8:24 UTC
  1 point
  0
  Parent
  Where can I read about this 2-state HMM? By learn I just mean approximate via an algorithm. The UAT is not sufficient as it talks about learning a known function. Baum-Welch is such an algorithm, but as a far as I am aware it gives no guarantees on anything really.