One important step in replacing shallow circuits with general circuits is removing the shallow circuit. I think many forms of regularization help with this removal step. That’s why weight decay and minibatch SGD are both helpful for grokking.
In terms of finding a problem or architecture where such combination is difficult, I also don’t know. I suspect that most architectures where it’s difficult to combine shallow circuits are just not effective at learning.
One important step in replacing shallow circuits with general circuits is removing the shallow circuit
Many metals do not coalesce into one single giant crystal, even in geological timescales, even when it is energetically feasible to do so. Why? Because the activation barrier is too high for it to occur in any “reasonable” timeframe—the speed of grain growth decreases exponentially with the activation energy.
(Remember, the hypothesis here implies that the shallow circuit is currently directly contributing to the accuracy of the output. Removing the shallow circuit carries an accuracy cost, the same as removing an atom from a metal grain carries an energy cost.)
I suspect you could draw a fairly clear analogy with thermodynamics and metal grain growth here.
I think there is likely to be a path from a model with shallow circuits to a model with deeper circuits which doesn’t need any ‘activation energy’ (it’s not stuck in a local minimum). For a model with many parameters, there are unlikely to be many places where all the deriviatives of the loss wrt all the parameters are zero. There will almost always be at least one direction to move in which decreases the shallow circuit while increasing the general one, and hence doesn’t really hurt the loss.
Hm. This may be a case where this domain is very different than the one I know, and my heuristics are all wrong.
In RTL I can see incrementally moving from one implementation of an algorithm to another implementation of the same algorithm, sure. I don’t see how you could incrementally move from, say, a large cascaded-adder multiplier to a Karatsuba multiplier, without passing through an intermediate state of higher loss.
In other words:
For a model with many parameters, there are unlikely to be many places where all the deriviatives of the loss wrt all the parameters are zero.
This is a strong assertion. I see the statistical argument that it is the case in the case of independent parameters—but the very fact that you’ve been training the network rather implies the parameters are no longer independent.
The connectivity of the transistors in a storage array of a 1MiB SRAM has millions[1] of parameters. Nevertheless, changing any one connection decreases the fitness. (And hence the derivative is ~0, with a negative 2nd derivative.) Changing a bunch of connections—say, by adding a row and removing a column—may improve fitness. But there’s a definite activation energy there.
Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
a⋅cascade_multiplier(x,y)+(1−a)⋅Karatsuba(x,y)
And a is changed from 1 to 0.
The network needs to be large enough so the algorithms don’t share parameters, so changing one doesn’t affect the performance of the other. I do think this is just a toy example, and it seems pretty unlikely to spontaneously have two multiplication algorithms develop simultaneously.
As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
Sure, but that’s not moving from A to B. That’s pruning from A+B to B. …which, now that I think about is, is effectively just a restatement of the Lotto Ticket Hypothesis[1].
Hm. I wonder if the Lotto Ticket Hypothesis holds for grok’d networks?
Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
I can see how the initial parameters are independent. After a significant amount of training though...?
One important step in replacing shallow circuits with general circuits is removing the shallow circuit. I think many forms of regularization help with this removal step. That’s why weight decay and minibatch SGD are both helpful for grokking.
In terms of finding a problem or architecture where such combination is difficult, I also don’t know. I suspect that most architectures where it’s difficult to combine shallow circuits are just not effective at learning.
Many metals do not coalesce into one single giant crystal, even in geological timescales, even when it is energetically feasible to do so. Why? Because the activation barrier is too high for it to occur in any “reasonable” timeframe—the speed of grain growth decreases exponentially with the activation energy.
(Remember, the hypothesis here implies that the shallow circuit is currently directly contributing to the accuracy of the output. Removing the shallow circuit carries an accuracy cost, the same as removing an atom from a metal grain carries an energy cost.)
I suspect you could draw a fairly clear analogy with thermodynamics and metal grain growth here.
I think there is likely to be a path from a model with shallow circuits to a model with deeper circuits which doesn’t need any ‘activation energy’ (it’s not stuck in a local minimum). For a model with many parameters, there are unlikely to be many places where all the deriviatives of the loss wrt all the parameters are zero. There will almost always be at least one direction to move in which decreases the shallow circuit while increasing the general one, and hence doesn’t really hurt the loss.
Hm. This may be a case where this domain is very different than the one I know, and my heuristics are all wrong.
In RTL I can see incrementally moving from one implementation of an algorithm to another implementation of the same algorithm, sure. I don’t see how you could incrementally move from, say, a large cascaded-adder multiplier to a Karatsuba multiplier, without passing through an intermediate state of higher loss.
In other words:
This is a strong assertion. I see the statistical argument that it is the case in the case of independent parameters—but the very fact that you’ve been training the network rather implies the parameters are no longer independent.
The connectivity of the transistors in a storage array of a 1MiB SRAM has millions[1] of parameters. Nevertheless, changing any one connection decreases the fitness. (And hence the derivative is ~0, with a negative 2nd derivative.) Changing a bunch of connections—say, by adding a row and removing a column—may improve fitness. But there’s a definite activation energy there.
Or, depending on how exactly you count, significantly more than millions of parameters.
Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
a⋅cascade_multiplier(x,y)+(1−a)⋅Karatsuba(x,y)
And a is changed from 1 to 0.
The network needs to be large enough so the algorithms don’t share parameters, so changing one doesn’t affect the performance of the other. I do think this is just a toy example, and it seems pretty unlikely to spontaneously have two multiplication algorithms develop simultaneously.
Sure, but that’s not moving from A to B. That’s pruning from A+B to B. …which, now that I think about is, is effectively just a restatement of the Lotto Ticket Hypothesis[1].
Hm. I wonder if the Lotto Ticket Hypothesis holds for grok’d networks?
https://arxiv.org/abs/1803.03635v1 etc.
I can see how the initial parameters are independent. After a significant amount of training though...?