Do we expect future model architectures to be biased toward out-of-context reasoning (reasoning internally rather than in a chain-of-thought)? As in, what kinds of capabilities would lead companies to build models that reason less and less in token-space?
I mean, the first obvious thing would be that you are training the model to internalize some of the reasoning rather than having to pay for the additional tokens each time you want to do complex reasoning.
The thing is, I expect we’ll eventually move away from just relying on transformers with scale. And so I’m trying to refine my understanding of the capabilities that are simply bottlenecked in this paradigm, and that model builders will need to resolve through architectural and algorithmic improvements. (Of course, based on my previous posts, I still think data is a big deal.)
Anyway, this kind of thinking eventually leads to the infohazardous area of, “okay then, what does the true AGI setup look like?” This is really annoying because it has alignment implications. If we start to move increasingly towards models that are reasoning outside of token-space, then alignment becomes harder. So, are there capability bottlenecks that eventually get resolved through something that requires out-of-context reasoning?
So far, it seems like the current paradigm will not be an issue on this front. Keep scaling transformers, and you don’t really get any big changes in the model’s likelihood of using out-of-context reasoning.
This is not limited to out-of-context reasoning. I’m trying to have a better understanding of the (dangerous) properties future models may develop simply as a result of needing to break a capability bottleneck. My worry is that many people end up over-indexing on the current transformer+scale paradigm (and this becomes insufficient for ASI), so they don’t work on the right kinds of alignment or governance projects.
---
I’m unsure how big of a deal this architecture will end up being, but the rumoured xLSTM just dropped. It seemingly outperforms other models at the same size:
Maybe it ends up just being another drop in the bucket, but I think we will see more attempts in this direction.
Claude summary:
The key points of the paper are:
The authors introduce exponential gating with memory mixing in the new sLSTM variant. This allows the model to revise storage decisions and solve state tracking problems, which transformers and state space models without memory mixing cannot do.
They equip the mLSTM variant with a matrix memory and covariance update rule, greatly enhancing the storage capacity compared to the scalar memory cell of vanilla LSTMs. Experiments show this matrix memory provides a major boost.
The sLSTM and mLSTM are integrated into residual blocks to form xLSTM blocks, which are then stacked into deep xLSTM architectures.
Extensive experiments demonstrate that xLSTMs outperform state-of-the-art transformers, state space models, and other LSTMs/RNNs on language modeling tasks, while also exhibiting strong scaling behavior to larger model sizes.
This work is important because it presents a path forward for scaling LSTMs to billions of parameters and beyond. By overcoming key limitations of vanilla LSTMs—the inability to revise storage, limited storage capacity, and lack of parallelizability—xLSTMs are positioned as a compelling alternative to transformers for large language modeling.
Instead of doing all computation step-by-step as tokens are processed, advanced models might need to store and manipulate information in a compressed latent space, and then “reason” over those latent representations in a non-sequential way.
The exponential gating with memory mixing introduced in the xLSTM paper directly addresses this need. Here’s how:
Exponential gating allows the model to strongly update or forget the contents of each memory cell based on the input. This is more powerful than the simple linear gating in vanilla LSTMs. It means the model can decisively revise its stored knowledge as needed, rather than being constrained to incremental changes. This flexibility is crucial for reasoning, as it allows the model to rapidly adapt its latent state based on new information.
Memory mixing means that each memory cell is updated using a weighted combination of the previous values of all cells. This allows information to flow and be integrated between cells in a non-sequential way. Essentially, it relaxes the sequential constraint of traditional RNNs and allows for a more flexible, graph-like computation over the latent space.
Together, these two components endow the xLSTM with a dynamic, updateable memory that can be accessed and manipulated “outside” the main token-by-token processing flow. The model can compress information into this memory, “reason” over it by mixing and gating cells, then produce outputs guided by the updated memory state.
In this way, the xLSTM takes a significant step towards the kind of “reasoning outside token-space” that I suggested would be important for highly capable models. The memory acts as a workspace for flexible computation that isn’t strictly tied to the input token sequence.
Now, this doesn’t mean the xLSTM is doing all the kinds of reasoning we might eventually want from an advanced AI system. But it demonstrates a powerful architecture for models to store and manipulate information in a latent space, at a more abstract level than individual tokens. As we scale up this approach, we can expect models to perform more and more “reasoning” in this compressed space rather than via explicit token-level computation.
I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values). Those would be retained in improved architectures that still predict human language use. If such a system’s thought process was entirely opaque, we could no longer perform Externalized reasoning oversight by “reading its thoughts”.
But think it might be possible to build a reliable agent from unreliable parts. I think humans are such an agent, and evolution made us this way because it’s a way to squeeze extra capability out of a set of base cognitive capacities.
Imagine an agentic set of scaffolding that merely calls the super-LLM for individual cognitive acts. Such an agent would use a hand-coded “System 2” thinking approach to solve problems, like humans do. That involves breaking a problem into cognitive steps. We also use System 2 for our biggest ethical decisions; we predict consequences of our major decisions, and compare them to our goals, including ethical goals. Such a synthetic agent would use System 2 for problem-solving capabilities, and also for checking plans for how well they achieve goals. This would be done for efficiency; spending a lot of compute or external resources on a bad plan would be quite costly. Having implemented it for efficiency, you might as well use it for safety.
This is just restating stuff I’ve said elsewhere, but I’m trying to refine the model, and work through how well it might work if you couldn’t apply any external reasoning oversight, and little to no interpretability. It’s definitely bad for the odds of success, but not necessarily crippling. I think.
This needs more thought. I’m working on a post on System 2 alignment, as sketched out briefly (and probably incomprehensibly) above.
I think future more powerful/useful AIs will understand our intentions better IF they are trained to predict language. Text corpuses contain rich semantics about human intentions.
I can imagine other AI systems that are trained differently, and I would be more worried about those.
That’s what I meant by current AI understanding our intentions possibly better than future AI.
Do we expect future model architectures to be biased toward out-of-context reasoning (reasoning internally rather than in a chain-of-thought)? As in, what kinds of capabilities would lead companies to build models that reason less and less in token-space?
I mean, the first obvious thing would be that you are training the model to internalize some of the reasoning rather than having to pay for the additional tokens each time you want to do complex reasoning.
The thing is, I expect we’ll eventually move away from just relying on transformers with scale. And so I’m trying to refine my understanding of the capabilities that are simply bottlenecked in this paradigm, and that model builders will need to resolve through architectural and algorithmic improvements. (Of course, based on my previous posts, I still think data is a big deal.)
Anyway, this kind of thinking eventually leads to the infohazardous area of, “okay then, what does the true AGI setup look like?” This is really annoying because it has alignment implications. If we start to move increasingly towards models that are reasoning outside of token-space, then alignment becomes harder. So, are there capability bottlenecks that eventually get resolved through something that requires out-of-context reasoning?
So far, it seems like the current paradigm will not be an issue on this front. Keep scaling transformers, and you don’t really get any big changes in the model’s likelihood of using out-of-context reasoning.
This is not limited to out-of-context reasoning. I’m trying to have a better understanding of the (dangerous) properties future models may develop simply as a result of needing to break a capability bottleneck. My worry is that many people end up over-indexing on the current transformer+scale paradigm (and this becomes insufficient for ASI), so they don’t work on the right kinds of alignment or governance projects.
---
I’m unsure how big of a deal this architecture will end up being, but the rumoured xLSTM just dropped. It seemingly outperforms other models at the same size:
Maybe it ends up just being another drop in the bucket, but I think we will see more attempts in this direction.
Claude summary:
The key points of the paper are:
The authors introduce exponential gating with memory mixing in the new sLSTM variant. This allows the model to revise storage decisions and solve state tracking problems, which transformers and state space models without memory mixing cannot do.
They equip the mLSTM variant with a matrix memory and covariance update rule, greatly enhancing the storage capacity compared to the scalar memory cell of vanilla LSTMs. Experiments show this matrix memory provides a major boost.
The sLSTM and mLSTM are integrated into residual blocks to form xLSTM blocks, which are then stacked into deep xLSTM architectures.
Extensive experiments demonstrate that xLSTMs outperform state-of-the-art transformers, state space models, and other LSTMs/RNNs on language modeling tasks, while also exhibiting strong scaling behavior to larger model sizes.
This work is important because it presents a path forward for scaling LSTMs to billions of parameters and beyond. By overcoming key limitations of vanilla LSTMs—the inability to revise storage, limited storage capacity, and lack of parallelizability—xLSTMs are positioned as a compelling alternative to transformers for large language modeling.
Instead of doing all computation step-by-step as tokens are processed, advanced models might need to store and manipulate information in a compressed latent space, and then “reason” over those latent representations in a non-sequential way.
The exponential gating with memory mixing introduced in the xLSTM paper directly addresses this need. Here’s how:
Exponential gating allows the model to strongly update or forget the contents of each memory cell based on the input. This is more powerful than the simple linear gating in vanilla LSTMs. It means the model can decisively revise its stored knowledge as needed, rather than being constrained to incremental changes. This flexibility is crucial for reasoning, as it allows the model to rapidly adapt its latent state based on new information.
Memory mixing means that each memory cell is updated using a weighted combination of the previous values of all cells. This allows information to flow and be integrated between cells in a non-sequential way. Essentially, it relaxes the sequential constraint of traditional RNNs and allows for a more flexible, graph-like computation over the latent space.
Together, these two components endow the xLSTM with a dynamic, updateable memory that can be accessed and manipulated “outside” the main token-by-token processing flow. The model can compress information into this memory, “reason” over it by mixing and gating cells, then produce outputs guided by the updated memory state.
In this way, the xLSTM takes a significant step towards the kind of “reasoning outside token-space” that I suggested would be important for highly capable models. The memory acts as a workspace for flexible computation that isn’t strictly tied to the input token sequence.
Now, this doesn’t mean the xLSTM is doing all the kinds of reasoning we might eventually want from an advanced AI system. But it demonstrates a powerful architecture for models to store and manipulate information in a latent space, at a more abstract level than individual tokens. As we scale up this approach, we can expect models to perform more and more “reasoning” in this compressed space rather than via explicit token-level computation.
This is an excellent point.
While LLMs seem (relatively) safe, we may very well blow right on by them soon.
I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values). Those would be retained in improved architectures that still predict human language use. If such a system’s thought process was entirely opaque, we could no longer perform Externalized reasoning oversight by “reading its thoughts”.
But think it might be possible to build a reliable agent from unreliable parts. I think humans are such an agent, and evolution made us this way because it’s a way to squeeze extra capability out of a set of base cognitive capacities.
Imagine an agentic set of scaffolding that merely calls the super-LLM for individual cognitive acts. Such an agent would use a hand-coded “System 2” thinking approach to solve problems, like humans do. That involves breaking a problem into cognitive steps. We also use System 2 for our biggest ethical decisions; we predict consequences of our major decisions, and compare them to our goals, including ethical goals. Such a synthetic agent would use System 2 for problem-solving capabilities, and also for checking plans for how well they achieve goals. This would be done for efficiency; spending a lot of compute or external resources on a bad plan would be quite costly. Having implemented it for efficiency, you might as well use it for safety.
This is just restating stuff I’ve said elsewhere, but I’m trying to refine the model, and work through how well it might work if you couldn’t apply any external reasoning oversight, and little to no interpretability. It’s definitely bad for the odds of success, but not necessarily crippling. I think.
This needs more thought. I’m working on a post on System 2 alignment, as sketched out briefly (and probably incomprehensibly) above.
Did you mean something different than “AIs understand our intentions” (e.g. maybe you meant that humans can understand the AI’s intentions?).
I think future more powerful AIs will surely be strictly better at understanding what humans intend.
I think future more powerful/useful AIs will understand our intentions better IF they are trained to predict language. Text corpuses contain rich semantics about human intentions.
I can imagine other AI systems that are trained differently, and I would be more worried about those.
That’s what I meant by current AI understanding our intentions possibly better than future AI.