We also investigate how the distribution of the gradients change throughout training. Here we plot the histogram of the gradients of the output MLP weights for the Pythia 125m model.
Very interesting post! How are you estimating the gradients for the animation? I noticed that the parameter gradients are not saved in the checkpoints.
Very interesting post! How are you estimating the gradients for the animation? I noticed that the parameter gradients are not saved in the checkpoints.