nostalgebraist comments on Visible Thoughts Project and Bounty Announcement

nostalgebraist Dec 1, 2021, 5:33 PM
5 points
Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.
I’ve fine-tuned GPT models on a bunch of different datasets of different sizes, although not this particular dataset (which doesn’t exist yet).
Below I list some key things to note. Also see here for related discussion. These points hold true for typical tasks/datasets, though a few unusual ones like arithmetic behave differently.
- GPT performance tends to scale smoothly and gradually with data/model size, over multiple orders of magnitude.
- In terms of subjective response, you don’t need much data to get GPTs to the level of “hey, it kinda gets it!”.
- You may need several orders of magnitude more data to reach the point of saturation where the model can’t improve with additional data.
- Incomplete mastery usually looks more like “randomly failing X% of the time” than “understanding X% of the content of the task,” which can make it difficult to assess quality (or quality differences) at a glance.
For a concrete example, here is a data scaling experiment I did with GPT-J (6.1B params) on the tumblr post dataset I use for my tumblr bot. My full dataset is roughly 4 times as large as the 30M word dataset proposed here, i.e. the 30M word dataset would be roughly as big as the 25% subsample shown in the report.
The linked report only shows val loss, which is not very interpretable, but at least conveys that I haven’t reached diminishing returns yet. This seems plausible from subjective evidence, as the model still sometimes misunderstands tumblr lingo / the conversational structure of the data / etc.