Sorry, I think I must have misunderstood your comment. When you wrote:
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we’d be in pretty damn good shape from an AI control perspective.
I interpreted this to mean that in addition to weak forward passes, there was another thing called “robust faithful CoT properties” that would also need to be achieved.
I now think you meant to indicate that “weak forward passes” was a particular method for achieving “robust faithful CoT properties”.
Ah, I think I see how I was being unclear maybe now, thanks for that question. It may help for me to go through the original quote and say what I do and don’t agree with:
IMO, there isn’t anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can’t do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses would fully apply. In the world where AIs basically can’t do invisible non-trivial consequentialist reasoning, most misalignment threat models don’t apply. (Scheming/deceptive alignmment and cleverly playing the training game both don’t apply.)
I agree with everything in this quote; however, I think that “assuming that we can also rule out steganography and similar concerns” is a big and unwarranted assumption, for reasons mentioned above. Our training schemes will probably incentivize steganography and subtle strategic bias.
If I understand correctly, sophisticated steganography will not be incentivized by a language model that’s just trained on next-token prediction (no RHLF). Is that right? (Since the model didn’t generate any of the text in its context window, there’s no method it could use to pass info to itself. The best it could do is pick up on weak-to-nonexistent human-created steganography in its dataset.)
That suggests a simple method to prevent steganography: Instead of having human raters rate RHLF-generated responses, have humans write responses themselves, then finetune on the human-written responses.
That might sound labor-intensive, but I don’t think it has to be. Instead of scoring your AI’s response, you check if the AI-written response is adequate. If it’s adequate, you do nothing. If it’s inadequate, you send the question to an expert who writes an expert-level response, then add that response to the finetuning dataset.
Sure, experts are expensive, but so are massive training runs. We know data quality is important at the language corpus stage. Why wouldn’t it be equally important at the finetuning stage? (Also, I’m not sure experts are that expensive, given the glut of PhDs in many fields. And expertise in one field could transfer to others, if you split the expert task into 2 steps: identification of high-quality sources, and use of those sources to answer the user’s query.)
In any case, I imagine it should be possible to detect steganography to an extent, by adding noise to the contents of the context window and checking if that causes unexpectedly high performance degradation. You could have a second AI that creates a paraphrased view of everything in the context window except the most recent 20 words or so—I expect that’s a massive blow to steganography in practice. Note that the cost of such added-noise schemes only needs to be paid at training time, so they can be expensive—e.g. ensembling across the next-token generated from many different distortions of the current text.
Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let’s try to calculate how much it would cost though.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
I guess it’s doable. At least for something like one trillion tokens; you probably just can’t do fifty trillion tokens even if you were the CCP.
...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it’ll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now.
As for using paraphrasing to detect steganography: Yes, I’m hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I’m betraying my ignorance here and this idea doesn’t make sense for some reason?
I was editing my comment a fair amount, perhaps you read an old version of it?
And, in terms of demonstrating feasibility, you don’t need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it’s nearly as good as the original ChatGPT, I think you should be good to go.
I said “if you want to do the equivalent of a modern large training run.” If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that’ll be proportionately cheaper. And that might be good enough. But maybe we won’t be able to get to AGI that way.
Sorry, I think I must have misunderstood your comment. When you wrote:
I interpreted this to mean that in addition to weak forward passes, there was another thing called “robust faithful CoT properties” that would also need to be achieved.
I now think you meant to indicate that “weak forward passes” was a particular method for achieving “robust faithful CoT properties”.
Ah, I think I see how I was being unclear maybe now, thanks for that question. It may help for me to go through the original quote and say what I do and don’t agree with:
I agree with everything in this quote; however, I think that “assuming that we can also rule out steganography and similar concerns” is a big and unwarranted assumption, for reasons mentioned above. Our training schemes will probably incentivize steganography and subtle strategic bias.
If I understand correctly, sophisticated steganography will not be incentivized by a language model that’s just trained on next-token prediction (no RHLF). Is that right? (Since the model didn’t generate any of the text in its context window, there’s no method it could use to pass info to itself. The best it could do is pick up on weak-to-nonexistent human-created steganography in its dataset.)
That suggests a simple method to prevent steganography: Instead of having human raters rate RHLF-generated responses, have humans write responses themselves, then finetune on the human-written responses.
That might sound labor-intensive, but I don’t think it has to be. Instead of scoring your AI’s response, you check if the AI-written response is adequate. If it’s adequate, you do nothing. If it’s inadequate, you send the question to an expert who writes an expert-level response, then add that response to the finetuning dataset.
Sure, experts are expensive, but so are massive training runs. We know data quality is important at the language corpus stage. Why wouldn’t it be equally important at the finetuning stage? (Also, I’m not sure experts are that expensive, given the glut of PhDs in many fields. And expertise in one field could transfer to others, if you split the expert task into 2 steps: identification of high-quality sources, and use of those sources to answer the user’s query.)
In any case, I imagine it should be possible to detect steganography to an extent, by adding noise to the contents of the context window and checking if that causes unexpectedly high performance degradation. You could have a second AI that creates a paraphrased view of everything in the context window except the most recent 20 words or so—I expect that’s a massive blow to steganography in practice. Note that the cost of such added-noise schemes only needs to be paid at training time, so they can be expensive—e.g. ensembling across the next-token generated from many different distortions of the current text.
Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let’s try to calculate how much it would cost though.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
I guess it’s doable. At least for something like one trillion tokens; you probably just can’t do fifty trillion tokens even if you were the CCP.
...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it’ll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now.
As for using paraphrasing to detect steganography: Yes, I’m hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I’m betraying my ignorance here and this idea doesn’t make sense for some reason?
I was editing my comment a fair amount, perhaps you read an old version of it?
And, in terms of demonstrating feasibility, you don’t need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it’s nearly as good as the original ChatGPT, I think you should be good to go.
I said “if you want to do the equivalent of a modern large training run.” If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that’ll be proportionately cheaper. And that might be good enough. But maybe we won’t be able to get to AGI that way.
Worth a shot though.