As a potential follow-up direction[1], I’m curious what might happen if you trained on a mixture of tasks, all expressed in similar notation, but not all requiring backward chaining[2].
Iterative algorithms in feedforward NNs have the interesting property that the NN can potentially “spend” a very large fraction of its weights on the algorithm, and will only reach peak performance on the task solved by the algorithm when it is doing this. If you have some layers that don’t implement the iterative step, you’re missing out on the ability to solve certain deeper task instances; on the other hand, if you do implement the step in every single layer, a large fraction of your weights (and possibly your runtime compute budget) is allocated to this one task, and the presence of other tasks would push back against doing this.
So I’m curious how a transformer might do this budgeting across tasks in practice. For instance
What happens to the iterative algorithm as we increase the amount of competition from other tasks?
Would we see a series of discrete transitions, where the model gives up on solving instances of depth N but still solves depth N−1 ~perfectly, and likewise for N−2, etc.?
Or, would the model use superposition to combine the iterative algorithm with computations for other tasks, in such a way that the iterative algorithm gets noisier and more error-prone without changing the max depth it can solve?
How are the task-specific algorithms, and the computation that “figures out the current task,” laid out across the depth of the model?
For example, we could imagine a model that allocates 1 or more layers at the start to “figuring out the current task,” and only starts backward chaining (if appropriate) after these layers. But of course this would reduce the max depth it can solve.
Or, we could imagine a model that starts backward chaining immediately, in parallel with “figuring out the current task,” and suppresses backward chaining later on if it turns out it’s not needed.
Or, a model that does backward chaining in parallel with other tasks the whole way through, and then reads off the answer to the current task at the very end.
Obviously these things would depend to some extend on details like model capacity and number of attention heads, as well as the nature of the tasks in the mixture.
In part I’m curious about this because it might shed light on where to look for iterative computation inside real, powerful LLMs. For instance, if we observed the “start backward chaining immediately, suppress later if not needed” pattern, that might suggest looking at very early layers of powerful models.
I include the “similar notation” requirement so that the model has to do some work to figure out whether or not backward chaining is appropriate—it can’t just learn a rule as simple as “start backward chaining immediately at positions containing certain tokens used to encode the path finding task, otherwise don’t do backward chaining.”
One way to ensure that the other tasks are sufficiently difficult but mostly “non-iterative” would be to just use a mixture of [the path finding problem] and [ordinary language modeling on a diverse corpus], except with the language modeling data transformed somehow to satisfy the “similar notation” requirement. Maybe replace common punctuation and whitespace tokens with notational tokens from the path finding task (or equivalently, notate the pathfinding task using puncutation and whitespace)?
Fascinating work!
As a potential follow-up direction[1], I’m curious what might happen if you trained on a mixture of tasks, all expressed in similar notation, but not all requiring backward chaining[2].
Iterative algorithms in feedforward NNs have the interesting property that the NN can potentially “spend” a very large fraction of its weights on the algorithm, and will only reach peak performance on the task solved by the algorithm when it is doing this. If you have some layers that don’t implement the iterative step, you’re missing out on the ability to solve certain deeper task instances; on the other hand, if you do implement the step in every single layer, a large fraction of your weights (and possibly your runtime compute budget) is allocated to this one task, and the presence of other tasks would push back against doing this.
So I’m curious how a transformer might do this budgeting across tasks in practice. For instance
What happens to the iterative algorithm as we increase the amount of competition from other tasks?
Would we see a series of discrete transitions, where the model gives up on solving instances of depth N but still solves depth N−1 ~perfectly, and likewise for N−2, etc.?
Or, would the model use superposition to combine the iterative algorithm with computations for other tasks, in such a way that the iterative algorithm gets noisier and more error-prone without changing the max depth it can solve?
How are the task-specific algorithms, and the computation that “figures out the current task,” laid out across the depth of the model?
For example, we could imagine a model that allocates 1 or more layers at the start to “figuring out the current task,” and only starts backward chaining (if appropriate) after these layers. But of course this would reduce the max depth it can solve.
Or, we could imagine a model that starts backward chaining immediately, in parallel with “figuring out the current task,” and suppresses backward chaining later on if it turns out it’s not needed.
Or, a model that does backward chaining in parallel with other tasks the whole way through, and then reads off the answer to the current task at the very end.
Obviously these things would depend to some extend on details like model capacity and number of attention heads, as well as the nature of the tasks in the mixture.
In part I’m curious about this because it might shed light on where to look for iterative computation inside real, powerful LLMs. For instance, if we observed the “start backward chaining immediately, suppress later if not needed” pattern, that might suggest looking at very early layers of powerful models.
Since you’re looking for collaborators, I should clarify that I’m not currently interested in doing this kind of experiment myself.
I include the “similar notation” requirement so that the model has to do some work to figure out whether or not backward chaining is appropriate—it can’t just learn a rule as simple as “start backward chaining immediately at positions containing certain tokens used to encode the path finding task, otherwise don’t do backward chaining.”
One way to ensure that the other tasks are sufficiently difficult but mostly “non-iterative” would be to just use a mixture of [the path finding problem] and [ordinary language modeling on a diverse corpus], except with the language modeling data transformed somehow to satisfy the “similar notation” requirement. Maybe replace common punctuation and whitespace tokens with notational tokens from the path finding task (or equivalently, notate the pathfinding task using puncutation and whitespace)?