More insightful than what is conserved under the scaling symmetry of ReLU networks is what is not conserved: the gradient. Scaling by scales by and by , which means that we can obtain arbitrarily large gradient norms by simply choosing small enough . And in general bad initializations can induce large imbalances in how quickly the parameters on either side of the neuron learn.
Some time ago I tried training some networks while setting these symmetries to the values that would minimize the total gradient norm, effectively trying to distribute the gradient norm as equally as possible throughout the network. This significantly accelerated learning, and allowed extremely deep (100+ layers) networks to be trained without residual layers. This isn’t that useful for modern networks because batchnorm/layernorm seems to effectively do the same thing, and isn’t dependent on having ReLU as the activation function.
Thus, the γ value is a “conserved quantity” under gradient descent associated with the symmetry. If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the γ value will still be conserved under gradient descent so long as we’re inside that region.
Minor detail, but this is false in practice because we are doing gradient descent with a non-zero learning rate, so there will be some diffusion between different hyperbolas in weight space as we take gradient steps of finite size.
The learning rate in modern optimisers is so large that the piecewise-linear loss landscape really looks indistinguishable from a smooth function. The lr you’d need to use to ensure that the next step lands in the same linear patch is ridiculously small, so in practice the true “felt” landscape is something like a smoothed average of the exact landscape.