The Pythia models is an amazing source. This is a great tool and work.
One experiment that could maybe help disentangle idiosyncrasies from robust behaviors would be to run these experiments with a pair of seeds on each model size. With the currently trained models this could maybe just involve plotting the exact same curves comparing the “deduplicated” versus “non deduplicated” trained models since dataset deduplication likely has a limited impact on the model averaged training dynamic of the weights as investigated here (there are obviously countless more experiments that could be added but this one is maybe an easy one).
The Pythia models is an amazing source. This is a great tool and work.
One experiment that could maybe help disentangle idiosyncrasies from robust behaviors would be to run these experiments with a pair of seeds on each model size. With the currently trained models this could maybe just involve plotting the exact same curves comparing the “deduplicated” versus “non deduplicated” trained models since dataset deduplication likely has a limited impact on the model averaged training dynamic of the weights as investigated here (there are obviously countless more experiments that could be added but this one is maybe an easy one).