(cross posting this comment from E. S. Yudkowksy’s Facebook with some edits / elaboration)
Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.
Note this is similar to the “self explaining AI” idea I explored in early 2020, which I threw together a paper on (I am hesitant to link to it because it’s not that great of a paper and much of the discussion there is CNN specific, but here it is.). I can see how producing “thoughts” could help us trust/determine how much a model really understands what’s going on or how to make a good story.
However I also could see the “thoughts” output misleading people—people might mistake the model’s explanations as mapping onto the calculations going on inside the model to produce an output. The way GPT-3 works, I suspect, is very far from how humans think. GPT-3 is very bad at a lot of common sense and physics-based reasoning, for instance, yet based on the thoughts output the user might be misled into thinking the model understands common sense notions or physics since it’s spouting off a version of some stuff it got from it’s training data.
Any work along these lines would definitely need empirical testing / studies to show that the extra “thoughts” output is useful to end-users in some way (like predicting failure modes or helping debug failures).
Also, I’m unclear on what constitutes a “run”… roughly how long does the text have to be, in words, to have a chance at getting $20,000?
We’re guessing 1000 steps per reasonably-completed run (more or less, doesn’t have to be exact) and guessing maybe 300 words per step, mostly ‘thought’. Where ‘thoughts’ can be relatively stream-of-consciousness once accustomed (we hope) and the dungeon run doesn’t have to be Hugo quality in its plotting, so it’s not like we’re asking for a 300,000-word edited novel.
However I also could see the “thoughts” output misleading people—people might mistake the model’s explanations as mapping onto the calculations going on inside the model to produce an output.
I think the key point on avoiding this is the intervening-on-the-thoughts part: ”An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts”.
So the idea is that you train things in such a way that the thoughts do map onto the calculations going on inside the model.
Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.
I’ve fine-tuned GPT models on a bunch of different datasets of different sizes, although not this particular dataset (which doesn’t exist yet).
Below I list some key things to note. Also see here for related discussion. These points hold true for typical tasks/datasets, though a few unusual ones like arithmetic behave differently.
GPT performance tends to scale smoothly and gradually with data/model size, over multiple orders of magnitude.
In terms of subjective response, you don’t need much data to get GPTs to the level of “hey, it kinda gets it!”.
You may need several orders of magnitude more data to reach the point of saturation where the model can’t improve with additional data.
Incomplete mastery usually looks more like “randomly failing X% of the time” than “understanding X% of the content of the task,” which can make it difficult to assess quality (or quality differences) at a glance.
For a concrete example, here is a data scaling experiment I did with GPT-J (6.1B params) on the tumblr post dataset I use for my tumblr bot. My full dataset is roughly 4 times as large as the 30M word dataset proposed here, i.e. the 30M word dataset would be roughly as big as the 25% subsample shown in the report.
The linked report only shows val loss, which is not very interpretable, but at least conveys that I haven’t reached diminishing returns yet. This seems plausible from subjective evidence, as the model still sometimes misunderstands tumblr lingo / the conversational structure of the data / etc.
(cross posting this comment from E. S. Yudkowksy’s Facebook with some edits / elaboration)
Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.
Note this is similar to the “self explaining AI” idea I explored in early 2020, which I threw together a paper on (I am hesitant to link to it because it’s not that great of a paper and much of the discussion there is CNN specific, but here it is.). I can see how producing “thoughts” could help us trust/determine how much a model really understands what’s going on or how to make a good story.
However I also could see the “thoughts” output misleading people—people might mistake the model’s explanations as mapping onto the calculations going on inside the model to produce an output. The way GPT-3 works, I suspect, is very far from how humans think. GPT-3 is very bad at a lot of common sense and physics-based reasoning, for instance, yet based on the thoughts output the user might be misled into thinking the model understands common sense notions or physics since it’s spouting off a version of some stuff it got from it’s training data.
Any work along these lines would definitely need empirical testing / studies to show that the extra “thoughts” output is useful to end-users in some way (like predicting failure modes or helping debug failures).
Also, I’m unclear on what constitutes a “run”… roughly how long does the text have to be, in words, to have a chance at getting $20,000?
We’re guessing 1000 steps per reasonably-completed run (more or less, doesn’t have to be exact) and guessing maybe 300 words per step, mostly ‘thought’. Where ‘thoughts’ can be relatively stream-of-consciousness once accustomed (we hope) and the dungeon run doesn’t have to be Hugo quality in its plotting, so it’s not like we’re asking for a 300,000-word edited novel.
The sample Nate linked is 30 pages and 12,267 words. So that works out to ~730 pages for a run.
$20,000/300,000 words = $1 per 15 words. If an author writing it manually could average 15 wpm, that would be $60/hour.
Sorry, I missed that somehow. Thanks.
I think the key point on avoiding this is the intervening-on-the-thoughts part:
”An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts”.
So the idea is that you train things in such a way that the thoughts do map onto the calculations going on inside the model.
I’ve fine-tuned GPT models on a bunch of different datasets of different sizes, although not this particular dataset (which doesn’t exist yet).
Below I list some key things to note. Also see here for related discussion. These points hold true for typical tasks/datasets, though a few unusual ones like arithmetic behave differently.
GPT performance tends to scale smoothly and gradually with data/model size, over multiple orders of magnitude.
In terms of subjective response, you don’t need much data to get GPTs to the level of “hey, it kinda gets it!”.
You may need several orders of magnitude more data to reach the point of saturation where the model can’t improve with additional data.
Incomplete mastery usually looks more like “randomly failing X% of the time” than “understanding X% of the content of the task,” which can make it difficult to assess quality (or quality differences) at a glance.
For a concrete example, here is a data scaling experiment I did with GPT-J (6.1B params) on the tumblr post dataset I use for my tumblr bot. My full dataset is roughly 4 times as large as the 30M word dataset proposed here, i.e. the 30M word dataset would be roughly as big as the 25% subsample shown in the report.
The linked report only shows val loss, which is not very interpretable, but at least conveys that I haven’t reached diminishing returns yet. This seems plausible from subjective evidence, as the model still sometimes misunderstands tumblr lingo / the conversational structure of the data / etc.
Using the stated length estimates per section, a single run would constitute approximately 600 pages of single spaced text. This is a lot of writing.