gwern comments on Some negative steganography results

gwern 9 Dec 2023 23:15 UTC
12 points
4

I’ve heard other people trying similar experiments and failing.

Interesting tests. I don’t think anyone would have been surprised if you had been able to induce steganography with these models & setups explicitly designed to do so, or if you had induced it in a large smart model but not a much smaller model, or if it had worked in an unpredictable subset for someone else but not you, etc which means that it is at least a little surprising that it wasn’t induced anywhere at all.

Given that we know that models at the scale of BERT & GPT-2-1.5b (and much smaller) can & do learn to ‘cheat’ by finding dataset biases & non-robust features (as confirmed by your first control experiment of finetuning on a blue/red-contaminated dataset) and there’s no good argument for such steganography being impossible in general, that raises the question of why they are unable to do the steganography you’re coaching them to do here. Is steganography that much more inherently difficult to learn than related phenomena, or is there something going wrong with the training or evaluations? (The latter, unfortunately, is always a live option in DRL.)

I have to say, it does seem a bit implausible to me that even GPT-3.5-turbo is unable to do so when GPT-3.5 is so powerful and your steganography instance is such a simple one. So I’m concerned about the absence of any baseline or ‘positive control’ which does successfully induce steganography: if all such experiments fail on even the simplest steganographic codes (while said models still cheat shamelessly & memorize benchmarks & overfit when given the slightest opportunity), then it would seem one has either stumbled across some sort of remarkable unexpected blindspot in RLFHing LLMs or one is setting it up wrong somehow and that’s why steganography never works. (In which case the results may be true as far as they go, but will be highly misleading as any kind of general result about natural steganography.) See also confirmation bias, Wason test, Feynman’s maze story etc.

I don’t see what blindspot there could be here (doesn’t look like BPEs or the GPT-4 sparsity blindspot should be at play here, nor can I think of any RLHF which might affect it), so my suspicion is there’s something missing here. (For example, the PPO for the GPT-2-large might be dominated by larger signals—LoRA regularization as well as KL-penalty?, and we don’t know what they do for OA finetuning so it’s possible the regularization of the finetuning is doing its job and destroying any learning of the signal, as it’s generally assumed that OA finetuning is doing something cheap akin to LoRA.)

So you need a positive control which you can then ablate to start getting interesting results: does steganography seem to require improbably large batch sizes? Improbably many serial steps? No regularization? No supervised finetuning (which is part of every realistic LLM workflow at this point)? Co-evolving models? etc
- ryan_greenblatt 9 Dec 2023 23:18 UTC
  5 points
  2
  Parent
  
  I have to say, it does seem a bit implausible to me that even GPT-3.5-turbo is unable to do so when GPT-3.5 is so powerful and your steganography instance is such a simple one.
  
  The “poor man’s RL” scheme I used here is quite weak (and perhaps poorly tuned). I suspect this is the primary issue, though limitations on the OpenAI finetuning API as you described could also be an issue.