kave comments on Grokking, memorization, and generalization — a discussion

kave 31 Oct 2023 1:55 UTC
2 points
0
just note that at a noisy configuration, you would expect “learnable directions” to be very noisy, and largely cancel each other out, so the gradient will be predominantly noise from the perspective of the circuits that are eventually learned
I think this is saying something like “parameters participate in multiple circuits and the needed value of that parameter across those circuits is randomly distributed”. Is that right?