TLW comments on Hypothesis: gradient descent prefers general circuits

TLW 14 Feb 2022 2:10 UTC
2 points
Hm. This may be a case where this domain is very different than the one I know, and my heuristics are all wrong.
In RTL I can see incrementally moving from one implementation of an algorithm to another implementation of the same algorithm, sure. I don’t see how you could incrementally move from, say, a large cascaded-adder multiplier to a Karatsuba multiplier, without passing through an intermediate state of higher loss.
In other words:
For a model with many parameters, there are unlikely to be many places where all the deriviatives of the loss wrt all the parameters are zero.
This is a strong assertion. I see the statistical argument that it is the case in the case of independent parameters—but the very fact that you’ve been training the network rather implies the parameters are no longer independent.
The connectivity of the transistors in a storage array of a 1MiB SRAM has millions^[1] of parameters. Nevertheless, changing any one connection decreases the fitness. (And hence the derivative is ~0, with a negative 2nd derivative.) Changing a bunch of connections—say, by adding a row and removing a column—may improve fitness. But there’s a definite activation energy there.
1. ^
  Or, depending on how exactly you count, significantly more than millions of parameters.
- peterbarnett 14 Feb 2022 10:08 UTC
  2 points
  Parent
  Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
  As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
  $a \cdot cascade_multiplier (x, y) + (1 - a) \cdot Karatsuba (x, y)$
  And $a$ is changed from 1 to 0.
  The network needs to be large enough so the algorithms don’t share parameters, so changing one doesn’t affect the performance of the other. I do think this is just a toy example, and it seems pretty unlikely to spontaneously have two multiplication algorithms develop simultaneously.
  - TLW 15 Feb 2022 3:06 UTC
    2 points
    Parent
    As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
    Sure, but that’s not moving from A to B. That’s pruning from A+B to B. …which, now that I think about is, is effectively just a restatement of the Lotto Ticket Hypothesis^[1].
    Hm. I wonder if the Lotto Ticket Hypothesis holds for grok’d networks?
    ^
    https://arxiv.org/abs/1803.03635v1 etc.
  - TLW 15 Feb 2022 3:08 UTC
    1 point
    Parent
    Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
    I can see how the initial parameters are independent. After a significant amount of training though...?