I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I’d like to collaborate with.
Website: https://jacquesthibodeau.com
Twitter: https://twitter.com/JacquesThibs
GitHub: https://github.com/JayThibs
Do we expect future model architectures to be biased toward out-of-context reasoning (reasoning internally rather than in a chain-of-thought)? As in, what kinds of capabilities would lead companies to build models that reason less and less in token-space?
I mean, the first obvious thing would be that you are training the model to internalize some of the reasoning rather than having to pay for the additional tokens each time you want to do complex reasoning.
The thing is, I expect we’ll eventually move away from just relying on transformers with scale. And so I’m trying to refine my understanding of the capabilities that are simply bottlenecked in this paradigm, and that model builders will need to resolve through architectural and algorithmic improvements. (Of course, based on my previous posts, I still think data is a big deal.)
Anyway, this kind of thinking eventually leads to the infohazardous area of, “okay then, what does the true AGI setup look like?” This is really annoying because it has alignment implications. If we start to move increasingly towards models that are reasoning outside of token-space, then alignment becomes harder. So, are there capability bottlenecks that eventually get resolved through something that requires out-of-context reasoning?
So far, it seems like the current paradigm will not be an issue on this front. Keep scaling transformers, and you don’t really get any big changes in the model’s likelihood of using out-of-context reasoning.
This is not limited to out-of-context reasoning. I’m trying to have a better understanding of the (dangerous) properties future models may develop simply as a result of needing to break a capability bottleneck. My worry is that many people end up over-indexing on the current transformer+scale paradigm (and this becomes insufficient for ASI), so they don’t work on the right kinds of alignment or governance projects.
---
I’m unsure how big of a deal this architecture will end up being, but the rumoured xLSTM just dropped. It seemingly outperforms other models at the same size:
Maybe it ends up just being another drop in the bucket, but I think we will see more attempts in this direction.
Claude summary:
The key points of the paper are:
The authors introduce exponential gating with memory mixing in the new sLSTM variant. This allows the model to revise storage decisions and solve state tracking problems, which transformers and state space models without memory mixing cannot do.
They equip the mLSTM variant with a matrix memory and covariance update rule, greatly enhancing the storage capacity compared to the scalar memory cell of vanilla LSTMs. Experiments show this matrix memory provides a major boost.
The sLSTM and mLSTM are integrated into residual blocks to form xLSTM blocks, which are then stacked into deep xLSTM architectures.
Extensive experiments demonstrate that xLSTMs outperform state-of-the-art transformers, state space models, and other LSTMs/RNNs on language modeling tasks, while also exhibiting strong scaling behavior to larger model sizes.
This work is important because it presents a path forward for scaling LSTMs to billions of parameters and beyond. By overcoming key limitations of vanilla LSTMs—the inability to revise storage, limited storage capacity, and lack of parallelizability—xLSTMs are positioned as a compelling alternative to transformers for large language modeling.
Instead of doing all computation step-by-step as tokens are processed, advanced models might need to store and manipulate information in a compressed latent space, and then “reason” over those latent representations in a non-sequential way.
The exponential gating with memory mixing introduced in the xLSTM paper directly addresses this need. Here’s how:
Exponential gating allows the model to strongly update or forget the contents of each memory cell based on the input. This is more powerful than the simple linear gating in vanilla LSTMs. It means the model can decisively revise its stored knowledge as needed, rather than being constrained to incremental changes. This flexibility is crucial for reasoning, as it allows the model to rapidly adapt its latent state based on new information.
Memory mixing means that each memory cell is updated using a weighted combination of the previous values of all cells. This allows information to flow and be integrated between cells in a non-sequential way. Essentially, it relaxes the sequential constraint of traditional RNNs and allows for a more flexible, graph-like computation over the latent space.
Together, these two components endow the xLSTM with a dynamic, updateable memory that can be accessed and manipulated “outside” the main token-by-token processing flow. The model can compress information into this memory, “reason” over it by mixing and gating cells, then produce outputs guided by the updated memory state.
In this way, the xLSTM takes a significant step towards the kind of “reasoning outside token-space” that I suggested would be important for highly capable models. The memory acts as a workspace for flexible computation that isn’t strictly tied to the input token sequence.
Now, this doesn’t mean the xLSTM is doing all the kinds of reasoning we might eventually want from an advanced AI system. But it demonstrates a powerful architecture for models to store and manipulate information in a latent space, at a more abstract level than individual tokens. As we scale up this approach, we can expect models to perform more and more “reasoning” in this compressed space rather than via explicit token-level computation.