Using almost the same training parameters as above (I used full batch and train_frac=0.5 to get faster & more consistent grokking, but I don’t think this matters here)
I did a few runs and the results all looked more or less like this. The training process of such toy models doesn’t contain so many bits of interesting information, so I wouldn’t be surprised if a variety of different metrics would capture this process in this case. (E.g. the training dynamics can be also modelled by an HMM, see here).
Using almost the same training parameters as above (I used full batch and train_frac=0.5 to get faster & more consistent grokking, but I don’t think this matters here)
I did a few runs and the results all looked more or less like this. The training process of such toy models doesn’t contain so many bits of interesting information, so I wouldn’t be surprised if a variety of different metrics would capture this process in this case. (E.g. the training dynamics can be also modelled by an HMM, see here).