How thoroughly are the datasets deduplicated? I would expect it to be much much higher leverage to increase the copy count of a text you’d want in there, compared to going from zero to one. If LLMs are still being trained with only one or a few passes on any given piece of data, then it’s not going to learn an idea much by the idea being present on one more datapoint. But if you can increase the copy number a lot, you can make it more likely that something like the idea gets learned.
You could for example try to translate your text into as many languages as possible (perhaps automatically, using GPT!), and then put all those translations into the dataset; or simply use GPT to “rewrite this text, keeping all the ideas the same, but changing some of the words”.
How thoroughly are the datasets deduplicated? I would expect it to be much much higher leverage to increase the copy count of a text you’d want in there, compared to going from zero to one. If LLMs are still being trained with only one or a few passes on any given piece of data, then it’s not going to learn an idea much by the idea being present on one more datapoint. But if you can increase the copy number a lot, you can make it more likely that something like the idea gets learned. You could for example try to translate your text into as many languages as possible (perhaps automatically, using GPT!), and then put all those translations into the dataset; or simply use GPT to “rewrite this text, keeping all the ideas the same, but changing some of the words”.