One notion of self-repair is redundancy; having “backup” components which do the same thing, should the original component fail for some reason. Some examples:
In the IOI circuit in gpt-2 small, there are primary “name mover heads” but also “backup name mover heads” which fire if the primary name movers are ablated. this is partially explained via copy suppression.
More generally, The Hydra effect: Ablating one attention head leads to other attention heads compensating for the ablated head.
Some other mechanisms for self-repair include “layernorm scaling” and “anti-erasure”, as described in Rushing and Nanda, 2024
Another notion of self-repair is “regulation”; suppressing an overstimulated component.
“Entropy neurons” reduce the models’ confidence by squeezing the logit distribution.
Self-repair is annoying from the interpretability perspective.
It creates an interpretability illusion; maybe the ablated component is actually playing a role in a task, but due to self-repair, activation patching shows an abnormally low effect.
A related thought: Grokked models probably do not exhibit self-repair.
In the “circuit cleanup” phase of grokking, redundant circuits are removed due to the L2 weight penalty incentivizing the model to shed these unused parameters.
I expect regulation to not occur as well, because there is always a single correct answer; hence a model that predicts this answer will be incentivized to be as confident as possible.
Error correction still probably does occur, because this is largely a consequence of superposition
Taken together, I guess this means that self-repair is a coping mechanism for the “noisiness” / “messiness” of real data like language.
It would be interesting to study whether introducing noise into synthetic data (that is normally grokkable by models) also breaks grokking (and thereby induces self-repair).
It’s a fascinating phenomenon. If I had to bet I would say it isn’t a coping mechanism but rather a particular manifestation of a deeper inductive bias of the learning process.
That’s a really interesting blogpost, thanks for sharing! I skimmed it but I didn’t really grasp the point you were making here. Can you explain what you think specifically causes self-repair?
I think self-repair might have lower free energy, in the sense that if you had two configurations of the weights, which “compute the same thing” but one of them has self-repair for a given behaviour and one doesn’t, then the one with self-repair will have lower free energy (which is just a way of saying that if you integrate the Bayesian posterior in a neighbourhood of both, the one with self-repair gives you a higher number, i.e. its preferred).
That intuition is based on some understanding of what controls the asymptotic (in the dataset size) behaviour of the free energy (which is -log(integral of posterior over region)) and the example in that post. But to be clear it’s just intuition. It should be possible to empirically check this somehow but it hasn’t been done.
Basically the argument is self-repair ⇒ robustness of behaviour to small variations in the weights ⇒ low local learning coefficient ⇒ low free energy ⇒ preferred
I think by “specifically” you might be asking for a mechanism which causes the self-repair to develop? I have no idea.
[Note] On self-repair in LLMs
A collection of empirical evidence
Do language models exhibit self-repair?
One notion of self-repair is redundancy; having “backup” components which do the same thing, should the original component fail for some reason. Some examples:
In the IOI circuit in gpt-2 small, there are primary “name mover heads” but also “backup name mover heads” which fire if the primary name movers are ablated. this is partially explained via copy suppression.
More generally, The Hydra effect: Ablating one attention head leads to other attention heads compensating for the ablated head.
Some other mechanisms for self-repair include “layernorm scaling” and “anti-erasure”, as described in Rushing and Nanda, 2024
Another notion of self-repair is “regulation”; suppressing an overstimulated component.
“Entropy neurons” reduce the models’ confidence by squeezing the logit distribution.
“Token prediction neurons” also function similarly
A third notion of self-repair is “error correction”.
Toy models of superposition suggests that NNs use ReLU to suppress small errors in computation
Error correction is predicted by Computation in Superposition
Empirically, it’s been found that models tolerate errors well along certain directions in the activation space
Self-repair is annoying from the interpretability perspective.
It creates an interpretability illusion; maybe the ablated component is actually playing a role in a task, but due to self-repair, activation patching shows an abnormally low effect.
A related thought: Grokked models probably do not exhibit self-repair.
In the “circuit cleanup” phase of grokking, redundant circuits are removed due to the L2 weight penalty incentivizing the model to shed these unused parameters.
I expect regulation to not occur as well, because there is always a single correct answer; hence a model that predicts this answer will be incentivized to be as confident as possible.
Error correction still probably does occur, because this is largely a consequence of superposition
Taken together, I guess this means that self-repair is a coping mechanism for the “noisiness” / “messiness” of real data like language.
It would be interesting to study whether introducing noise into synthetic data (that is normally grokkable by models) also breaks grokking (and thereby induces self-repair).
It’s a fascinating phenomenon. If I had to bet I would say it isn’t a coping mechanism but rather a particular manifestation of a deeper inductive bias of the learning process.
That’s a really interesting blogpost, thanks for sharing! I skimmed it but I didn’t really grasp the point you were making here. Can you explain what you think specifically causes self-repair?
I think self-repair might have lower free energy, in the sense that if you had two configurations of the weights, which “compute the same thing” but one of them has self-repair for a given behaviour and one doesn’t, then the one with self-repair will have lower free energy (which is just a way of saying that if you integrate the Bayesian posterior in a neighbourhood of both, the one with self-repair gives you a higher number, i.e. its preferred).
That intuition is based on some understanding of what controls the asymptotic (in the dataset size) behaviour of the free energy (which is -log(integral of posterior over region)) and the example in that post. But to be clear it’s just intuition. It should be possible to empirically check this somehow but it hasn’t been done.
Basically the argument is self-repair ⇒ robustness of behaviour to small variations in the weights ⇒ low local learning coefficient ⇒ low free energy ⇒ preferred
I think by “specifically” you might be asking for a mechanism which causes the self-repair to develop? I have no idea.