Here’s intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there’s multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.
That seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically “what’s your relative contribution to good subcircuits vs. bad subcircuits?”
Plausible.
Here’s intuition pump to consider: suppose our net is a complete multigraph: not only is there an edge between every pair of nodes, there’s multiple edges with base-2-exponentially-spaced weights, so we can always pick out a subset of them to get any total weight we please between the two nodes. Clearly, masking can turn this into any circuit we please (with the same number of nodes). But it seems wrong to say that this initial circuit has anything useful in it at all.
That seems right, but also reminds me of the point that you need to randomly initialize your neural nets for gradient descent to work (because otherwise the gradients everywhere are the same). Like, in the randomly initialized net, each edge is going to be part of many subcircuits, both good and bad, and the gradient is basically “what’s your relative contribution to good subcircuits vs. bad subcircuits?”