Reducing the capability of language models on dangerous domains through adding small amounts of training data.
Suppose someone prompts GPT-5 with “the code for a superintelligent AI is”. If GPT-5 is good at generalizing out of distribution, then it may well produce code for an actual superintelligence. This is obviously dangerous. But suppose similar prompts appeared 100 times in its training dataset. Each time followed by nonsensical junk code.
Then the prompt is in the training distribution, and GPT-n will respond by writing nonsense. Safe nonsense.
How hard is it to create this text. Potentially not very. Directly typing it wouldn’t take that long, large language models can pick up patterns from a few examples, often just 1. And you can repeat each sample you do type, perhaps with minor word variations. All sorts of tricks can generate semicoherent gibberish, from small language models, to markov chains, context free grammars, or just shuffling the lines of real open source code. Or an entirely functional linear regression algorithm.
How hard is it to get data into a future large language model. Not very. The datasets seem to be indiscriminately scrapped off most of the internet. Just putting the code on github should be enough.
Please don’t do this off the bat, leave a chance for people to spot any good reasons not to do this that may exist first. (Unilateralists curse.)
Reducing the capability of language models on dangerous domains through adding small amounts of training data.
Suppose someone prompts GPT-5 with “the code for a superintelligent AI is”. If GPT-5 is good at generalizing out of distribution, then it may well produce code for an actual superintelligence. This is obviously dangerous. But suppose similar prompts appeared 100 times in its training dataset. Each time followed by nonsensical junk code.
Then the prompt is in the training distribution, and GPT-n will respond by writing nonsense. Safe nonsense.
How hard is it to create this text. Potentially not very. Directly typing it wouldn’t take that long, large language models can pick up patterns from a few examples, often just 1. And you can repeat each sample you do type, perhaps with minor word variations. All sorts of tricks can generate semicoherent gibberish, from small language models, to markov chains, context free grammars, or just shuffling the lines of real open source code. Or an entirely functional linear regression algorithm.
How hard is it to get data into a future large language model. Not very. The datasets seem to be indiscriminately scrapped off most of the internet. Just putting the code on github should be enough.
Please don’t do this off the bat, leave a chance for people to spot any good reasons not to do this that may exist first. (Unilateralists curse.)