LawrenceC comments on Paper+Summary: OMNIGROK: GROKKING BEYOND ALGORITHMIC DATA

LawrenceC 4 Oct 2022 19:12 UTC
6 points
4
It doesn’t matter that there are multiple networks with the same performance but different L2 norms. Instead, it suffices that the optimal network differs for different L2 norms, or that the gradient updates during training point in different directions when the network is L2 norms are constrained. Both are indeed true.

It also makes a lot of sense, if you think about it in terms of ordinary statistical learning theory. Assuming for a second that we’re sampling neural networks that achieve a certain train loss at a certain weight norm randomly, there’s some amount of regularization (IE, some small weight norm) that leads to the lowest test loss.