gwern comments on Is it Legal to Maintain Turing Tests using Data Poisoning, and would it work?

gwern 16 Oct 2024 20:25 UTC
9 points
1
I semi-agree with #2: if you use mostly old and highly-curated data as a “seed” dataset for generating synthetic data from, you do bound the extent to which self-replicating memes and perona and Sydneys can infect the model. If there is a Sydney-2 in later data, it obviously cannot exist in some snapshot taken at an earlier date. And if the model is fully trained with a strong personality, like a “Claude”, and only then exposed to data infected by a Sydney or Sydney-2, you might reasonably expect there to be much less infection: the same way that you or I could roleplay as “Sydney” but we are in no danger of being hijacked by those outputs into being an ‘actual’ Sydney because we’re not base models, so too a fully-trained safety-tuned LLM might have been mode-collapsed down so heavily onto 1 persona that it is far more difficult for any alternate persona to be instantiated meaningfully.

I don’t think this is water-tight. It’s not obvious that any known data-quality screening suffices here, or indeed, even in principle what screening you would do to block Sydney but not God or Obama or the Easter Rabbit. (How would you, for example, define a data-quality screening procedure for creating a seed dataset which could be efficiently implemented at scale on petabytes of candidate data, which you did not hand-engineer specifically to block Sydney, but which successfully blocks automatically, say, the New York Times articles containing extensive quotes from & descriptions of Sydney?) Sydney-infected data will still be there from the start of training because NYT data is so high quality, etc. But yes, the more you cut the causal loops from the datasets to other agents, the more you guard yourself from malign influences from those.

I am not sure I buy your #1. A bootstrapping synthetic data LLM seems more prone to iteratively developing steganography by training on its own datasets, because it’s easier to amplify its own protocols and coordinate with itself through the many generations of bootstrapping. (This is why self-play tends to lead to such ‘neuralese’.) And to the extent that they are smarter (because that approach works better & is why it’s being used), they are going to be that much better at developing or learning steganography sample-efficiently.

What the hermetic seal of bootstrapping means is that the LLM/Internet pollution effect of #2 is less likely to happen… as long as, of course, you manage to cut the data off early enough that there are not enough examples to let steganography emerge somewhere in the generations of bootstraps anyway. It’s far from obvious to me that even cutting off the data right now would be adequate. (There’s so many kinds of covert communication or discussion already out there, after all. And current LLMs are already so smart that you don’t even need to provide examples of many things, just descriptions.)

And you further have the problem that now the post-training phase is not going to be robust to this—unlike viral persona—because there’s no reason to not learn steganography if it’s useful. You might succeed in bootstrapping a frontier LLM which is robust to any persona viruses, sure, but when you refresh it with recent scraped data or user inputs, it notices steganography from other models in 1 odd sample, and realizes what a nifty idea that is and how it saves a lot of tokens etc, and boom! now you have a steganographic model. The model doesn’t even have to be finetuned, necessarily—information might be getting smuggled around in “plain text” (like some of the more horrifying corners of Unicode) as a prefix trigger. (The longer context windows/prompts are, the more prompt prefixes can “pay their way”, I’d note.) We’ve seen some early experiments in trying to make self-replicating prompts or texts...
- Noosphere89 16 Oct 2024 21:21 UTC
  2 points
  0
  Parent
  I admit I was focusing on a fully automated synthetic dataset and fully automated curation, with virtually 0 use of internet data, such that you can entirely make your own private datasets without having to interact with the internet at all, so you could entirely avoid the steganography and Sydney data problems at all.