P. comments on Paper+Summary: OMNIGROK: GROKKING BEYOND ALGORITHMIC DATA

P. 4 Oct 2022 16:51 UTC
2 points
0
If the optimal norm is below the minimum you can achieve just by re-scaling, you are trading-off training set accuracy for weights with a smaller norm within each layer. It’s not that weird that the best known way of making this trade-off is by constrained optimization.