I’m curious how well test loss is predicted by unprincipled metrics in this setup. For instance, how well is it predicted by the l2 norm of the weights? What about average_log_probability_on_train?
(Average log prob on train is loss on test if you assume that test labels are unrelated to the model’s train predictions and train log probs have the same distribution as test log probs. You could also do average_log_probability_on_test which is a metric you can run without needing test labels as long as you have test inputs.)
Using almost the same training parameters as above (I used full batch and train_frac=0.5 to get faster & more consistent grokking, but I don’t think this matters here)
I did a few runs and the results all looked more or less like this. The training process of such toy models doesn’t contain so many bits of interesting information, so I wouldn’t be surprised if a variety of different metrics would capture this process in this case. (E.g. the training dynamics can be also modelled by an HMM, see here).
I’m curious how well test loss is predicted by unprincipled metrics in this setup. For instance, how well is it predicted by the l2 norm of the weights? What about average_log_probability_on_train?
(Average log prob on train is loss on test if you assume that test labels are unrelated to the model’s train predictions and train log probs have the same distribution as test log probs. You could also do average_log_probability_on_test which is a metric you can run without needing test labels as long as you have test inputs.)
Using almost the same training parameters as above (I used full batch and train_frac=0.5 to get faster & more consistent grokking, but I don’t think this matters here)
I did a few runs and the results all looked more or less like this. The training process of such toy models doesn’t contain so many bits of interesting information, so I wouldn’t be surprised if a variety of different metrics would capture this process in this case. (E.g. the training dynamics can be also modelled by an HMM, see here).
Here’s the plot, which is very similar to Experience Machine’s:
My conclusion from this is that the LLC and the L2 norm measure basically the same thing in this setup. They don’t always: for further comparison with more unprincipled metrics in more complex setups, see comparisons with weight norm / Hessians in fig 22, 23, and 25 here and comparisons with Hessian-based methods and ablations here.