Yes, it’s a function of the data, as well as the model architecture / training routine. See my reply in this thread.
Also, the value of the irreducible loss isn’t important for the conclusions discussed in this post. What we care about is how loss varies with data and parameter count.
Those, too, are functions of the data, but different groups training large LMs use qualitatively similar datasets, so I would expect the conclusions here to apply across the board.
Yes, it’s a function of the data, as well as the model architecture / training routine. See my reply in this thread.
Also, the value of the irreducible loss isn’t important for the conclusions discussed in this post. What we care about is how loss varies with data and parameter count.
Those, too, are functions of the data, but different groups training large LMs use qualitatively similar datasets, so I would expect the conclusions here to apply across the board.