Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let’s try to calculate how much it would cost though.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
I guess it’s doable. At least for something like one trillion tokens; you probably just can’t do fifty trillion tokens even if you were the CCP.
...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it’ll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now.
As for using paraphrasing to detect steganography: Yes, I’m hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I’m betraying my ignorance here and this idea doesn’t make sense for some reason?
I was editing my comment a fair amount, perhaps you read an old version of it?
And, in terms of demonstrating feasibility, you don’t need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it’s nearly as good as the original ChatGPT, I think you should be good to go.
I said “if you want to do the equivalent of a modern large training run.” If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that’ll be proportionately cheaper. And that might be good enough. But maybe we won’t be able to get to AGI that way.
Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let’s try to calculate how much it would cost though.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
I guess it’s doable. At least for something like one trillion tokens; you probably just can’t do fifty trillion tokens even if you were the CCP.
...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it’ll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now.
As for using paraphrasing to detect steganography: Yes, I’m hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I’m betraying my ignorance here and this idea doesn’t make sense for some reason?
I was editing my comment a fair amount, perhaps you read an old version of it?
And, in terms of demonstrating feasibility, you don’t need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it’s nearly as good as the original ChatGPT, I think you should be good to go.
I said “if you want to do the equivalent of a modern large training run.” If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that’ll be proportionately cheaper. And that might be good enough. But maybe we won’t be able to get to AGI that way.
Worth a shot though.