tailcalled comments on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

tailcalled 18 Apr 2024 13:26 UTC
9 points
0
One thing I’m concerned about is that this seems most likely to work for rigid structures like CNNs and RNNs, rather than dynamic structures like Transformers. Obviously the original proof of concept was done in a transformer, but it was done in a transformer that was modelling a Markov model, whereas in the general case, transformers can model non-Markov processes

Well, sort of—obviously they ultimately still have a fixed context window, but the difficulty in solving the quadratic bottleneck suggests that this context window is an important distorting factor in how Transformers work—though maybe Mamba will save us, idk.