Sorry if this is obvious, but where does the “irreducible” loss come from? Wouldn’t that also be a function of the data, or I guess the data’s predictability?
Yes, it’s a function of the data, as well as the model architecture / training routine. See my reply in this thread.
Also, the value of the irreducible loss isn’t important for the conclusions discussed in this post. What we care about is how loss varies with data and parameter count.
Those, too, are functions of the data, but different groups training large LMs use qualitatively similar datasets, so I would expect the conclusions here to apply across the board.
Sorry if this is obvious, but where does the “irreducible” loss come from? Wouldn’t that also be a function of the data, or I guess the data’s predictability?
Yes, it’s a function of the data, as well as the model architecture / training routine. See my reply in this thread.
Also, the value of the irreducible loss isn’t important for the conclusions discussed in this post. What we care about is how loss varies with data and parameter count.
Those, too, are functions of the data, but different groups training large LMs use qualitatively similar datasets, so I would expect the conclusions here to apply across the board.