TLDR: I’m scared Figure 3 is wrong (the one with training loss/parameters).
WHY:
From page 2: ”… we perform our analysis on the smoothed training loss which is an unbiased estimate of the test loss ”
This claim is true. However, it is estimating average loss during training. For a fixed compute budget, larger models take less gradient steps and thus exhibit larger loss for a larger fraction of training time. If they estimate training loss in this way for Figure 3, I would expect them to underestimate the training loss of the larger models.
EXPERIMENT:
If anyone has access to training loss .csv files, we can reproduce Figure 3 using loss from the last 100 iterations. All my concerns go away if we get the same plot.
TLDR: I’m scared Figure 3 is wrong (the one with training loss/parameters).
WHY: From page 2: ”… we perform our analysis on the smoothed training loss which is an unbiased estimate of the test loss ”
This claim is true. However, it is estimating average loss during training. For a fixed compute budget, larger models take less gradient steps and thus exhibit larger loss for a larger fraction of training time. If they estimate training loss in this way for Figure 3, I would expect them to underestimate the training loss of the larger models.
EXPERIMENT: If anyone has access to training loss .csv files, we can reproduce Figure 3 using loss from the last 100 iterations. All my concerns go away if we get the same plot.