Is there any sort of regularization in the training process, favouring parameters that aren’t particularly large in magnitude? I suspect that even a very shallow gradient toward parameters with smaller absolute magnitude would favour more compact representations that retain symmetries.
Is there any sort of regularization in the training process, favouring parameters that aren’t particularly large in magnitude? I suspect that even a very shallow gradient toward parameters with smaller absolute magnitude would favour more compact representations that retain symmetries.