There’s no general theoretical reason that I am aware of to expect a relation between the L2 norm and the LLC. The LLC is the coefficient of the term in the asymptotic expansion of the free energy (negative logarithm of the integral of the posterior over a local region, as a function of sample size ) while the L2 norm of the parameter shows up in the constant order term of that same expansion, if you’re taking a Gaussian prior.
It might be that in particular classes of neural networks there is some architecture-specific correlation between the L2 norm and the LLC, but I am not aware of any experimental or theoretical evidence for that.
For example, in the figure below from Hoogland et al 2024 we see that there are later stages of training in a transformer trained to do linear-regression in context (blue shaded regions) where the LLC is decreasing but the L2 norm is increasing. So the model is moving towards a “simpler” parameter with larger weight norm.
My best current guess is that it happens to be, in the grokking example, that the simpler solution has smaller weight norm. This could be true in many synthetic settings, for all I know; however, in general, it is not the case that complexity (at least as far as SLT is concerned) and weight norm are correlated.
et al (!)