A 1-layer LSTM gets inputs of the form [operand 1][operator][operand 2], e.g.1+2 or 3*5
It is trained (I think with a regression loss? but it’s not clear[1]) to predict the numerical result of the binary operation
The paper proposes an auxiliary loss that is supposed to improve “compositionality.”
As described in the paper, this loss is is the average squared difference between successive LSTM hidden states
But, in the actual code, what is actually is instead the average squared difference between successive inputembeddings
The paper finds (unsurprisingly) that this extra loss doesn’t help on the main task[2], while making various other errors and infelicities along the way
e.g. there’s train-test leakage, and (hilariously) it doesn’t cite the right source for the LSTM architecture[3]
The theoretical justification presented for the “compositional loss” is very brief and unconvincing. But if I read into it a bit, I can see why it might make sense for the loss described in the paper (on hidden states).
This could regularize an LSTM to be produce something closer to a simple sum or average of the input embeddings, which is “compositional” in the sense that inputs for different timesteps might end up in different subspaces. It’s still not very clear why you would want this property (it seems like at this point you’re just saying you don’t really want an LSTM, as this is trying to regularize away some of the LSTM’s flexibility), nor why an LSTM was chosen in the first place (in 2025!), but I can at least see where the idea came from.
However, the loss actually used in the code makes no sense at all. The embeddings can’t see one another, and the inputs are sampled independently from one another in data generation, so the code’s auxiliary loss is effectively just trying to make the input embeddings for all vocab tokens closer to one another in L2 norm. This has nothing to do with compositionality, and anyway, I suspect that the rest of the network can always “compensate for it” in principle by scaling up the input weights of the LSTM layer.[4]
If it had actually used the loss on hidden states as described, this would still be a bad paper: it reports a negative result and under-motivates the idea so that it’s not clear why the negative result might be noteworthy. (Plus: LSTM, weird arithmetic regression toy task, etc.)
Once you take into account the nonsensical loss that was actually used, it’s just… nothing. The idea makes no sense, was not motivated at all, was inaccurately described in the paper, and does not work in practice.
To Sakana’s credit, they did document all of these problems in their notes on the paper – although they were less critical than I am. In principle they could have hidden away the code issue rather than mentioning it, and the paper would have seemed less obviously bad… I guess this is a really low bar, but still, it’s something.
The Sakana code review shows that it’s evaluated with a regression loss, but it’s not clear out of context what criterion (the training loss function) is AFAICT.
(Edit after looking over the code again: the model returns a number, and the data generation code also returns the targets as numbers. So the loss function is comparing a number to another one. Unless it’s doing some cursed re-tokenization thing internally, it’s regression.)
Note that it never directly tests for “compositionality,” just for test set performance on the main task.
Although in a few places it conflates its so-called “compositional loss” with compositionality itself, e.g. claims that the regularization “effectively enforces compositionality” when in fact the evidence just shows that it decreases the auxiliary loss, which of course it does – that’s what happens when you minimize something, it goes down.
Hochreiter & Schmidhuber 1997 has over 100k citations, it’s one of the most familiarly cited references in all of ML, you’d think an LLM would have memorized that at least!
Although in practice this is limited by the learning rate and the duration of training, which may explain why the paper got worse main-task performance with lower regularization even though the regularization is “conceptually” a no-op.
The first thing that comes to mind is to beg the question of what proportion of human-generated papers are publishing-worthier (since a lot of them are slop), but let’s not forget that publication matters little for catastrophic risk, it’s actually getting results that would be important. So I recommend not updating at all on AI risk based on Sakana’s results (or updating negatively if you expected that R&D automation would come faster, or that this might slow down human augmentation).
This is a very low-quality paper.
Basically, the paper does the following:
A 1-layer LSTM gets inputs of the form
[operand 1][operator][operand 2]
, e.g.1+2
or3*5
It is trained (I think with a regression loss? but it’s not clear[1]) to predict the numerical result of the binary operation
The paper proposes an auxiliary loss that is supposed to improve “compositionality.”
As described in the paper, this loss is is the average squared difference between successive LSTM hidden states
But, in the actual code, what is actually is instead the average squared difference between successive input embeddings
The paper finds (unsurprisingly) that this extra loss doesn’t help on the main task[2], while making various other errors and infelicities along the way
e.g. there’s train-test leakage, and (hilariously) it doesn’t cite the right source for the LSTM architecture[3]
The theoretical justification presented for the “compositional loss” is very brief and unconvincing. But if I read into it a bit, I can see why it might make sense for the loss described in the paper (on hidden states).
This could regularize an LSTM to be produce something closer to a simple sum or average of the input embeddings, which is “compositional” in the sense that inputs for different timesteps might end up in different subspaces. It’s still not very clear why you would want this property (it seems like at this point you’re just saying you don’t really want an LSTM, as this is trying to regularize away some of the LSTM’s flexibility), nor why an LSTM was chosen in the first place (in 2025!), but I can at least see where the idea came from.
However, the loss actually used in the code makes no sense at all. The embeddings can’t see one another, and the inputs are sampled independently from one another in data generation, so the code’s auxiliary loss is effectively just trying to make the input embeddings for all vocab tokens closer to one another in L2 norm. This has nothing to do with compositionality, and anyway, I suspect that the rest of the network can always “compensate for it” in principle by scaling up the input weights of the LSTM layer.[4]
If it had actually used the loss on hidden states as described, this would still be a bad paper: it reports a negative result and under-motivates the idea so that it’s not clear why the negative result might be noteworthy. (Plus: LSTM, weird arithmetic regression toy task, etc.)
Once you take into account the nonsensical loss that was actually used, it’s just… nothing. The idea makes no sense, was not motivated at all, was inaccurately described in the paper, and does not work in practice.
To Sakana’s credit, they did document all of these problems in their notes on the paper – although they were less critical than I am. In principle they could have hidden away the code issue rather than mentioning it, and the paper would have seemed less obviously bad… I guess this is a really low bar, but still, it’s something.
The Sakana code review shows that it’s evaluated with a regression loss, but it’s not clear out of context what
criterion
(the training loss function) is AFAICT.(Edit after looking over the code again: the model returns a number, and the data generation code also returns the targets as numbers. So the loss function is comparing a number to another one. Unless it’s doing some cursed re-tokenization thing internally, it’s regression.)
Note that it never directly tests for “compositionality,” just for test set performance on the main task.
Although in a few places it conflates its so-called “compositional loss” with compositionality itself, e.g. claims that the regularization “effectively enforces compositionality” when in fact the evidence just shows that it decreases the auxiliary loss, which of course it does – that’s what happens when you minimize something, it goes down.
Hochreiter & Schmidhuber 1997 has over 100k citations, it’s one of the most familiarly cited references in all of ML, you’d think an LLM would have memorized that at least!
Although in practice this is limited by the learning rate and the duration of training, which may explain why the paper got worse main-task performance with lower regularization even though the regularization is “conceptually” a no-op.
The first thing that comes to mind is to beg the question of what proportion of human-generated papers are publishing-worthier (since a lot of them are slop), but let’s not forget that publication matters little for catastrophic risk, it’s actually getting results that would be important.
So I recommend not updating at all on AI risk based on Sakana’s results (or updating negatively if you expected that R&D automation would come faster, or that this might slow down human augmentation).