Swap and Scale
Produced as part of the SERI MATS program under John Wentworth
Thanks to Garrett Barker for pointing out you could think of Swap Symmetry as relabeling and to Matthias G. Mayer for catching my mistake in not multiplying the bias in Scale Symmetry.
Here are two ways you can change the weights of any network using ReLU’s without changing the behaviour of the network. The first is a discrete symmetry and involves swapping nodes around, the second is a continuous symmetry and involves scaling the weights. (Astute readers will notice that this is where the title has come from.)
I’m currently investigating the behaviour of non-trivial symmetries in networks but that is going to be a much longer post and probably a month or two away. I thought describing the trivial ones would make for a fun bite-sized post to write on a Friday afternoon.
Trivial Symmetries
There are two types of symmetries that occur for any point in parameter space. I refer to these as the trivial symmetries. I’m unsure who the first person to point these out for ReLU neural networks was, but equivalent symmetries for networks using tanh activation functions were mentioned in Bishop’s 2006 “Pattern Recognition and Machine Learning” and cites “Functionally Equivalent Feedforward Neural Networks” (Kurkova and Kainen, 1994) as proving the case for networks in general. In terms of Machine Learning literature this is ancient, and I suspect the authors haven’t even heard of Lesswrong.
Swap Symmetry
For any hidden layer, we can imagine permuting the weights and biases in such a way that the output to the next layer remains unchanged. Specifically, we apply one permutation to the weights in the “incoming weights” and the biases of the current layer, then apply the inverse permutation to the weights in the next layer. This has the effect of completely undoing the first permutation.
In the simplest example where our layer only has two nodes, we can imagine swapping two neurons around whilst retaining their original connections.
It may also be helpful to think of this symmetry as simply relabelling the nodes in a single layer. For every hidden layer, there are N! such symmetries, where N is the number of nodes in the layer.
Scaling Symmetry
The second symmetry occurs when we multiply the weights leading into a node by some non-zero real number α, multiply the bias by α and multiply the weights leading out of the node by 1 over α. This has the effect of leaving the final weight unchanged, due to the linear nature of the ReLU function.
Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead.
I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance.
Opinion:
Symmetries in NNs are a mainstream ML research area with lots of papers, and I don’t think doing research “from first principles” here will be productive. This also holds for many other alignment projects.
However I do think it makes sense as an alignment-positive research direction in general.
Thank you, I hadn’t seen those papers they are both fantastic.
See also A Primer on Matrix Calculus, Part 2: Jacobians and other fun by Matthew Barnett.