Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.
I’ve fine-tuned GPT models on a bunch of different datasets of different sizes, although not this particular dataset (which doesn’t exist yet).
Below I list some key things to note. Also see here for related discussion. These points hold true for typical tasks/datasets, though a few unusual ones like arithmetic behave differently.
GPT performance tends to scale smoothly and gradually with data/model size, over multiple orders of magnitude.
In terms of subjective response, you don’t need much data to get GPTs to the level of “hey, it kinda gets it!”.
You may need several orders of magnitude more data to reach the point of saturation where the model can’t improve with additional data.
Incomplete mastery usually looks more like “randomly failing X% of the time” than “understanding X% of the content of the task,” which can make it difficult to assess quality (or quality differences) at a glance.
For a concrete example, here is a data scaling experiment I did with GPT-J (6.1B params) on the tumblr post dataset I use for my tumblr bot. My full dataset is roughly 4 times as large as the 30M word dataset proposed here, i.e. the 30M word dataset would be roughly as big as the 25% subsample shown in the report.
The linked report only shows val loss, which is not very interpretable, but at least conveys that I haven’t reached diminishing returns yet. This seems plausible from subjective evidence, as the model still sometimes misunderstands tumblr lingo / the conversational structure of the data / etc.
I’ve fine-tuned GPT models on a bunch of different datasets of different sizes, although not this particular dataset (which doesn’t exist yet).
Below I list some key things to note. Also see here for related discussion. These points hold true for typical tasks/datasets, though a few unusual ones like arithmetic behave differently.
GPT performance tends to scale smoothly and gradually with data/model size, over multiple orders of magnitude.
In terms of subjective response, you don’t need much data to get GPTs to the level of “hey, it kinda gets it!”.
You may need several orders of magnitude more data to reach the point of saturation where the model can’t improve with additional data.
Incomplete mastery usually looks more like “randomly failing X% of the time” than “understanding X% of the content of the task,” which can make it difficult to assess quality (or quality differences) at a glance.
For a concrete example, here is a data scaling experiment I did with GPT-J (6.1B params) on the tumblr post dataset I use for my tumblr bot. My full dataset is roughly 4 times as large as the 30M word dataset proposed here, i.e. the 30M word dataset would be roughly as big as the 25% subsample shown in the report.
The linked report only shows val loss, which is not very interpretable, but at least conveys that I haven’t reached diminishing returns yet. This seems plausible from subjective evidence, as the model still sometimes misunderstands tumblr lingo / the conversational structure of the data / etc.