I think those objections are important to mention and discuss, but they don’t undermine the conclusion significantly.
AIs which are qualitatively just as smart as humans could still be dangerous in the classic ways. The OP’s argument still applies to them, insofar as they are agentic and capable of plotting on the inside etc.
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we’d be in pretty damn good shape from an AI control perspective. I have been working on this myself & encourage others to do so also. I don’t think it undermines the OP’s points though? We are not currently on a path to have robust faithful CoT properties by default.
This post seemed overconfident in a number of places, so I was quickly pushing back in those places.
I also think the conclusion of “Nearly No Data” is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn’t seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.
If this post argued “the fact that current chat bots trained normally don’t seem to exhibit catastrophic misalignment isn’t much evidence about catastrophic misalignment in more powerful systems”, then I wouldn’t think this was overstated (though this also wouldn’t be very original). But, it makes stronger claims which seem false to me.
Mm, I concede that this might not have been the most accurate title. I might’ve let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.
My core point is something like “the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI’s cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for”.
I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.
What kind of citation are you looking for? Are you basically just asking me to provide evidence, or are you asking me to make an object-level argument (as opposed to e.g. an appeal to authority)? Or something else entirely, e.g. a publication?
You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.
If the argument hasn’t been made somewhere, perhaps you could write a short post making that argument. Could be a good way to either catalyze research in the area (you stated that you wish to encourage such research), or else convince people that the challenge is insurmountable and a different approach is needed.
You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.4
You may have interpreted it that way, but I certainly don’t follow a policy of prefacing everything with “in my opinion” unless I have a citation ready. I bet you don’t either. Claims are by default just claims, not claims-presented-as-established-facts. If I wanted to present it as an established fact I would have done something to indicate that, e.g. cited something or said “it is well-known that...”
Anyhow, I’m happy to defend the claim here. It would help if I knew where to begin. I’ll just briefly link a couple things here to get started and then we can zoom in on whatever bits you are most interested in.
First: There are probably going to be incentives for AIs to conceal their thoughts sometimes. Sometimes this will allow them to perform better in training, for example. Link to example post making this point, though many others have made it also.
Second: Some AI designs involve a natural language bottleneck; the only way for the system to communicate with its future self is via outputting tokens that the future self then reads in input. This gives us hope that it’ll conveniently-for-us put all the important thoughts in natural language, so that e.g. it won’t leave out anything important such as “the reason I’m telling you this is not because I’m trying to help you but rather because I think this’ll cause me to score more highly” etc. Here’s a post articulating this hope.
Third: Alas, the incentives mentioned above will probably crush this hope. Maybe if we didn’t do RLHF for example, and just used pre-trained models with clever prompts, we’d be fine. But by default we’ll apply RLHF to the whole chain of thought, and not leave any secret scratchpad, for example. The post above talks about this briefly.
Fourth: Even if that wasn’t true, the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don’t have a natural language bottleneck. My guess is that they’ll come into play right around the time we hit AGI.
My positive hope is that we can basically draw attention to this issue and figure out what sorts of training schemes create & preserve faithful CoT properties, and which don’t. (Seems pretty doable to me, it just takes some engineer and scientist legwork) and then coordinate governments and labs to resist the competitive pressure to train using schemes that don’t have those properties (also doable, if we had well-established science of those properties, which we don’t yet).
the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don’t have a natural language bottleneck
This one looks fatal. (I think the rest of the reasons could be dealt with somehow.)
What existing alternative architectures do you have in mind? I guess mamba would be one?
Do you think it’s realistic to regulate this? F.e. requiring that above certain size, models can’t have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)
I think it would be realistic to regulate this if the science of faithful CoT was better developed. If there were lots of impressive papers to cite about CoT faithfulness for example, and lots of whitepapers arguing for the importance of faithfulness to alignment and safety.
As it is, it seems unlikely to be politically viable… but maybe it’s still worth a shot?
Yeah, true. But it’s also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there’s less resistance, it doesn’t break anyone’s plans.
Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won’t see clear differences yet.) The hard part would be finding the right tasks.
Agreed. I was working on this for six months and I’ve been trying to get more people to work on it.
We don’t have a way of measuring CoT faithfulness as far as I know, in general—but you emphasize “tasks where we can evaluate...” that seems intriguing to me, you are saying it may be feasible today for some tasks at least. What tasks do you have in mind?
Now that I though about it, for this particular transformers vs mamba experiment, I’d go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:
x = 5
x += 2
x *= 3
x **= 2
x -= 3
...
and then have a CoT:
after x = 5
5
after x += 2
7
...
And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on.
(It’s an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.)
Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won’t have a limit.
I was working on this for six months
Can you say what you tried in these six months and how did it go?
Sorry, I think I must have misunderstood your comment. When you wrote:
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we’d be in pretty damn good shape from an AI control perspective.
I interpreted this to mean that in addition to weak forward passes, there was another thing called “robust faithful CoT properties” that would also need to be achieved.
I now think you meant to indicate that “weak forward passes” was a particular method for achieving “robust faithful CoT properties”.
Ah, I think I see how I was being unclear maybe now, thanks for that question. It may help for me to go through the original quote and say what I do and don’t agree with:
IMO, there isn’t anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can’t do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses would fully apply. In the world where AIs basically can’t do invisible non-trivial consequentialist reasoning, most misalignment threat models don’t apply. (Scheming/deceptive alignmment and cleverly playing the training game both don’t apply.)
I agree with everything in this quote; however, I think that “assuming that we can also rule out steganography and similar concerns” is a big and unwarranted assumption, for reasons mentioned above. Our training schemes will probably incentivize steganography and subtle strategic bias.
If I understand correctly, sophisticated steganography will not be incentivized by a language model that’s just trained on next-token prediction (no RHLF). Is that right? (Since the model didn’t generate any of the text in its context window, there’s no method it could use to pass info to itself. The best it could do is pick up on weak-to-nonexistent human-created steganography in its dataset.)
That suggests a simple method to prevent steganography: Instead of having human raters rate RHLF-generated responses, have humans write responses themselves, then finetune on the human-written responses.
That might sound labor-intensive, but I don’t think it has to be. Instead of scoring your AI’s response, you check if the AI-written response is adequate. If it’s adequate, you do nothing. If it’s inadequate, you send the question to an expert who writes an expert-level response, then add that response to the finetuning dataset.
Sure, experts are expensive, but so are massive training runs. We know data quality is important at the language corpus stage. Why wouldn’t it be equally important at the finetuning stage? (Also, I’m not sure experts are that expensive, given the glut of PhDs in many fields. And expertise in one field could transfer to others, if you split the expert task into 2 steps: identification of high-quality sources, and use of those sources to answer the user’s query.)
In any case, I imagine it should be possible to detect steganography to an extent, by adding noise to the contents of the context window and checking if that causes unexpectedly high performance degradation. You could have a second AI that creates a paraphrased view of everything in the context window except the most recent 20 words or so—I expect that’s a massive blow to steganography in practice. Note that the cost of such added-noise schemes only needs to be paid at training time, so they can be expensive—e.g. ensembling across the next-token generated from many different distortions of the current text.
Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let’s try to calculate how much it would cost though.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
I guess it’s doable. At least for something like one trillion tokens; you probably just can’t do fifty trillion tokens even if you were the CCP.
...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it’ll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now.
As for using paraphrasing to detect steganography: Yes, I’m hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I’m betraying my ignorance here and this idea doesn’t make sense for some reason?
I was editing my comment a fair amount, perhaps you read an old version of it?
And, in terms of demonstrating feasibility, you don’t need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it’s nearly as good as the original ChatGPT, I think you should be good to go.
I said “if you want to do the equivalent of a modern large training run.” If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that’ll be proportionately cheaper. And that might be good enough. But maybe we won’t be able to get to AGI that way.
I think those objections are important to mention and discuss, but they don’t undermine the conclusion significantly.
AIs which are qualitatively just as smart as humans could still be dangerous in the classic ways. The OP’s argument still applies to them, insofar as they are agentic and capable of plotting on the inside etc.
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we’d be in pretty damn good shape from an AI control perspective. I have been working on this myself & encourage others to do so also. I don’t think it undermines the OP’s points though? We are not currently on a path to have robust faithful CoT properties by default.
This post seemed overconfident in a number of places, so I was quickly pushing back in those places.
I also think the conclusion of “Nearly No Data” is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn’t seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.
If this post argued “the fact that current chat bots trained normally don’t seem to exhibit catastrophic misalignment isn’t much evidence about catastrophic misalignment in more powerful systems”, then I wouldn’t think this was overstated (though this also wouldn’t be very original). But, it makes stronger claims which seem false to me.
Mm, I concede that this might not have been the most accurate title. I might’ve let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.
My core point is something like “the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI’s cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for”.
I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.
OK, that seems reasonable to me.
Is there a citation for this?
What kind of citation are you looking for? Are you basically just asking me to provide evidence, or are you asking me to make an object-level argument (as opposed to e.g. an appeal to authority)? Or something else entirely, e.g. a publication?
You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.
If the argument hasn’t been made somewhere, perhaps you could write a short post making that argument. Could be a good way to either catalyze research in the area (you stated that you wish to encourage such research), or else convince people that the challenge is insurmountable and a different approach is needed.
You may have interpreted it that way, but I certainly don’t follow a policy of prefacing everything with “in my opinion” unless I have a citation ready. I bet you don’t either. Claims are by default just claims, not claims-presented-as-established-facts. If I wanted to present it as an established fact I would have done something to indicate that, e.g. cited something or said “it is well-known that...”
Anyhow, I’m happy to defend the claim here. It would help if I knew where to begin. I’ll just briefly link a couple things here to get started and then we can zoom in on whatever bits you are most interested in.
First: There are probably going to be incentives for AIs to conceal their thoughts sometimes. Sometimes this will allow them to perform better in training, for example. Link to example post making this point, though many others have made it also.
Second: Some AI designs involve a natural language bottleneck; the only way for the system to communicate with its future self is via outputting tokens that the future self then reads in input. This gives us hope that it’ll conveniently-for-us put all the important thoughts in natural language, so that e.g. it won’t leave out anything important such as “the reason I’m telling you this is not because I’m trying to help you but rather because I think this’ll cause me to score more highly” etc. Here’s a post articulating this hope.
Third: Alas, the incentives mentioned above will probably crush this hope. Maybe if we didn’t do RLHF for example, and just used pre-trained models with clever prompts, we’d be fine. But by default we’ll apply RLHF to the whole chain of thought, and not leave any secret scratchpad, for example. The post above talks about this briefly.
Fourth: Even if that wasn’t true, the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don’t have a natural language bottleneck. My guess is that they’ll come into play right around the time we hit AGI.
My positive hope is that we can basically draw attention to this issue and figure out what sorts of training schemes create & preserve faithful CoT properties, and which don’t. (Seems pretty doable to me, it just takes some engineer and scientist legwork) and then coordinate governments and labs to resist the competitive pressure to train using schemes that don’t have those properties (also doable, if we had well-established science of those properties, which we don’t yet).
This one looks fatal. (I think the rest of the reasons could be dealt with somehow.)
What existing alternative architectures do you have in mind? I guess mamba would be one?
Do you think it’s realistic to regulate this? F.e. requiring that above certain size, models can’t have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)
I think it would be realistic to regulate this if the science of faithful CoT was better developed. If there were lots of impressive papers to cite about CoT faithfulness for example, and lots of whitepapers arguing for the importance of faithfulness to alignment and safety.
As it is, it seems unlikely to be politically viable… but maybe it’s still worth a shot?
Yeah, true. But it’s also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there’s less resistance, it doesn’t break anyone’s plans.
Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won’t see clear differences yet.) The hard part would be finding the right tasks.
Agreed. I was working on this for six months and I’ve been trying to get more people to work on it.
We don’t have a way of measuring CoT faithfulness as far as I know, in general—but you emphasize “tasks where we can evaluate...” that seems intriguing to me, you are saying it may be feasible today for some tasks at least. What tasks do you have in mind?
Unfortunately I didn’t have any particular tasks in mind when I wrote it. I was vaguely thinking about settings as in:
https://arxiv.org/pdf/2305.04388.pdf
https://arxiv.org/pdf/2307.13702.pdf
Now that I though about it, for this particular transformers vs mamba experiment, I’d go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:
and then have a CoT:
And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on.
(It’s an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.)
Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won’t have a limit.
Can you say what you tried in these six months and how did it go?
Sorry, I think I must have misunderstood your comment. When you wrote:
I interpreted this to mean that in addition to weak forward passes, there was another thing called “robust faithful CoT properties” that would also need to be achieved.
I now think you meant to indicate that “weak forward passes” was a particular method for achieving “robust faithful CoT properties”.
Ah, I think I see how I was being unclear maybe now, thanks for that question. It may help for me to go through the original quote and say what I do and don’t agree with:
I agree with everything in this quote; however, I think that “assuming that we can also rule out steganography and similar concerns” is a big and unwarranted assumption, for reasons mentioned above. Our training schemes will probably incentivize steganography and subtle strategic bias.
If I understand correctly, sophisticated steganography will not be incentivized by a language model that’s just trained on next-token prediction (no RHLF). Is that right? (Since the model didn’t generate any of the text in its context window, there’s no method it could use to pass info to itself. The best it could do is pick up on weak-to-nonexistent human-created steganography in its dataset.)
That suggests a simple method to prevent steganography: Instead of having human raters rate RHLF-generated responses, have humans write responses themselves, then finetune on the human-written responses.
That might sound labor-intensive, but I don’t think it has to be. Instead of scoring your AI’s response, you check if the AI-written response is adequate. If it’s adequate, you do nothing. If it’s inadequate, you send the question to an expert who writes an expert-level response, then add that response to the finetuning dataset.
Sure, experts are expensive, but so are massive training runs. We know data quality is important at the language corpus stage. Why wouldn’t it be equally important at the finetuning stage? (Also, I’m not sure experts are that expensive, given the glut of PhDs in many fields. And expertise in one field could transfer to others, if you split the expert task into 2 steps: identification of high-quality sources, and use of those sources to answer the user’s query.)
In any case, I imagine it should be possible to detect steganography to an extent, by adding noise to the contents of the context window and checking if that causes unexpectedly high performance degradation. You could have a second AI that creates a paraphrased view of everything in the context window except the most recent 20 words or so—I expect that’s a massive blow to steganography in practice. Note that the cost of such added-noise schemes only needs to be paid at training time, so they can be expensive—e.g. ensembling across the next-token generated from many different distortions of the current text.
Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let’s try to calculate how much it would cost though.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
I guess it’s doable. At least for something like one trillion tokens; you probably just can’t do fifty trillion tokens even if you were the CCP.
...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it’ll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now.
As for using paraphrasing to detect steganography: Yes, I’m hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I’m betraying my ignorance here and this idea doesn’t make sense for some reason?
I was editing my comment a fair amount, perhaps you read an old version of it?
And, in terms of demonstrating feasibility, you don’t need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it’s nearly as good as the original ChatGPT, I think you should be good to go.
I said “if you want to do the equivalent of a modern large training run.” If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that’ll be proportionately cheaper. And that might be good enough. But maybe we won’t be able to get to AGI that way.
Worth a shot though.