AIs which aren’t qualitatively smarter than humans could be transformatively useful (e.g. automate away all human research).
It’s plausible that LLM agents will be able to fully obsolete human research while also being incapable of doing non-trivial consequentialist reasoning in just a forward pass (instead they would do this reasoning in natural language).
AIs which aren’t qualitatively smarter than humans could be transformatively useful
Perfectly aligned AI systems which were exactly as smart as humans and had the same capability profile, but which operated at 100x speed and were cheap would be extremely useful. In particular, we could vastly exceed all current alignment work in the span of a year.
In practice, the capability profile of AIs is unlikely to be exactly the same as humans. Further, even if the capability profile was the same, merely human level systems likely pose substantial misalignment concerns.
However, it does seem reasonably likely that AI systems will have a reasonably similar capability profile to humans and will also run faster and be cheaper. Thus, approaches like AI control could be very useful.
LLM agents with weak forward passes
IMO, there isn’t anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can’t do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses would fully apply. In the world where AIs basically can’t do invisible non-trivial consequentialist reasoning, most misalignment threat models don’t apply. (Scheming/deceptive alignmment and cleverly playing the training game both don’t apply.)
I think it’s unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.
I edited this comment from “I have two main objections to this post:” because that doesn’t quite seem like a good description of what this comment is saying. See this other comment for more meta-level commentary.
I think it’s unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.
I expect the probability to be >> 15% for the following reasons.
This all suggests to me it should be quite likely possible (especially with a large, dedicated effort) to get to something like a ~human-level automated alignment researcher with a relatively weak forward pass.
For an additional intuition why I expect this to be possible, I can conceive of humans who would both make great alignment researchers while doing ~all of their (conscious) thinking in speech-like inner monologue and would also be terrible schemers if they tried to scheme without using any scheming-relevant inner monologue; e.g. scheming/deception probably requires more deliberate effort for some people on the ASD spectrum.
I think o1 is significant evidence in favor of the story here; and I expect OpenAI’s model to be further evidence still if, as rumored, it will be pretrained on CoT synthetic data.
The learning aspect could be strategically crucial with respect to what the first transformatively-useful AIs should look like; also see e.g. discussion here and here. In the sense that this should add further reasons to think the first such AIs should probably (differentially) benefit from learning from data using intermediate outputs like CoT; or at least have a pretraining-like phase involving such intermediate outputs, even if this might be later distilled or modified some other way—e.g. replaced with [less transparent] recurrence.
I think those objections are important to mention and discuss, but they don’t undermine the conclusion significantly.
AIs which are qualitatively just as smart as humans could still be dangerous in the classic ways. The OP’s argument still applies to them, insofar as they are agentic and capable of plotting on the inside etc.
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we’d be in pretty damn good shape from an AI control perspective. I have been working on this myself & encourage others to do so also. I don’t think it undermines the OP’s points though? We are not currently on a path to have robust faithful CoT properties by default.
This post seemed overconfident in a number of places, so I was quickly pushing back in those places.
I also think the conclusion of “Nearly No Data” is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn’t seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.
If this post argued “the fact that current chat bots trained normally don’t seem to exhibit catastrophic misalignment isn’t much evidence about catastrophic misalignment in more powerful systems”, then I wouldn’t think this was overstated (though this also wouldn’t be very original). But, it makes stronger claims which seem false to me.
Mm, I concede that this might not have been the most accurate title. I might’ve let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.
My core point is something like “the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI’s cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for”.
I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.
What kind of citation are you looking for? Are you basically just asking me to provide evidence, or are you asking me to make an object-level argument (as opposed to e.g. an appeal to authority)? Or something else entirely, e.g. a publication?
You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.
If the argument hasn’t been made somewhere, perhaps you could write a short post making that argument. Could be a good way to either catalyze research in the area (you stated that you wish to encourage such research), or else convince people that the challenge is insurmountable and a different approach is needed.
You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.4
You may have interpreted it that way, but I certainly don’t follow a policy of prefacing everything with “in my opinion” unless I have a citation ready. I bet you don’t either. Claims are by default just claims, not claims-presented-as-established-facts. If I wanted to present it as an established fact I would have done something to indicate that, e.g. cited something or said “it is well-known that...”
Anyhow, I’m happy to defend the claim here. It would help if I knew where to begin. I’ll just briefly link a couple things here to get started and then we can zoom in on whatever bits you are most interested in.
First: There are probably going to be incentives for AIs to conceal their thoughts sometimes. Sometimes this will allow them to perform better in training, for example. Link to example post making this point, though many others have made it also.
Second: Some AI designs involve a natural language bottleneck; the only way for the system to communicate with its future self is via outputting tokens that the future self then reads in input. This gives us hope that it’ll conveniently-for-us put all the important thoughts in natural language, so that e.g. it won’t leave out anything important such as “the reason I’m telling you this is not because I’m trying to help you but rather because I think this’ll cause me to score more highly” etc. Here’s a post articulating this hope.
Third: Alas, the incentives mentioned above will probably crush this hope. Maybe if we didn’t do RLHF for example, and just used pre-trained models with clever prompts, we’d be fine. But by default we’ll apply RLHF to the whole chain of thought, and not leave any secret scratchpad, for example. The post above talks about this briefly.
Fourth: Even if that wasn’t true, the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don’t have a natural language bottleneck. My guess is that they’ll come into play right around the time we hit AGI.
My positive hope is that we can basically draw attention to this issue and figure out what sorts of training schemes create & preserve faithful CoT properties, and which don’t. (Seems pretty doable to me, it just takes some engineer and scientist legwork) and then coordinate governments and labs to resist the competitive pressure to train using schemes that don’t have those properties (also doable, if we had well-established science of those properties, which we don’t yet).
the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don’t have a natural language bottleneck
This one looks fatal. (I think the rest of the reasons could be dealt with somehow.)
What existing alternative architectures do you have in mind? I guess mamba would be one?
Do you think it’s realistic to regulate this? F.e. requiring that above certain size, models can’t have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)
I think it would be realistic to regulate this if the science of faithful CoT was better developed. If there were lots of impressive papers to cite about CoT faithfulness for example, and lots of whitepapers arguing for the importance of faithfulness to alignment and safety.
As it is, it seems unlikely to be politically viable… but maybe it’s still worth a shot?
Yeah, true. But it’s also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there’s less resistance, it doesn’t break anyone’s plans.
Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won’t see clear differences yet.) The hard part would be finding the right tasks.
Agreed. I was working on this for six months and I’ve been trying to get more people to work on it.
We don’t have a way of measuring CoT faithfulness as far as I know, in general—but you emphasize “tasks where we can evaluate...” that seems intriguing to me, you are saying it may be feasible today for some tasks at least. What tasks do you have in mind?
Now that I though about it, for this particular transformers vs mamba experiment, I’d go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:
x = 5
x += 2
x *= 3
x **= 2
x -= 3
...
and then have a CoT:
after x = 5
5
after x += 2
7
...
And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on.
(It’s an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.)
Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won’t have a limit.
I was working on this for six months
Can you say what you tried in these six months and how did it go?
Sorry, I think I must have misunderstood your comment. When you wrote:
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we’d be in pretty damn good shape from an AI control perspective.
I interpreted this to mean that in addition to weak forward passes, there was another thing called “robust faithful CoT properties” that would also need to be achieved.
I now think you meant to indicate that “weak forward passes” was a particular method for achieving “robust faithful CoT properties”.
Ah, I think I see how I was being unclear maybe now, thanks for that question. It may help for me to go through the original quote and say what I do and don’t agree with:
IMO, there isn’t anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can’t do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses would fully apply. In the world where AIs basically can’t do invisible non-trivial consequentialist reasoning, most misalignment threat models don’t apply. (Scheming/deceptive alignmment and cleverly playing the training game both don’t apply.)
I agree with everything in this quote; however, I think that “assuming that we can also rule out steganography and similar concerns” is a big and unwarranted assumption, for reasons mentioned above. Our training schemes will probably incentivize steganography and subtle strategic bias.
If I understand correctly, sophisticated steganography will not be incentivized by a language model that’s just trained on next-token prediction (no RHLF). Is that right? (Since the model didn’t generate any of the text in its context window, there’s no method it could use to pass info to itself. The best it could do is pick up on weak-to-nonexistent human-created steganography in its dataset.)
That suggests a simple method to prevent steganography: Instead of having human raters rate RHLF-generated responses, have humans write responses themselves, then finetune on the human-written responses.
That might sound labor-intensive, but I don’t think it has to be. Instead of scoring your AI’s response, you check if the AI-written response is adequate. If it’s adequate, you do nothing. If it’s inadequate, you send the question to an expert who writes an expert-level response, then add that response to the finetuning dataset.
Sure, experts are expensive, but so are massive training runs. We know data quality is important at the language corpus stage. Why wouldn’t it be equally important at the finetuning stage? (Also, I’m not sure experts are that expensive, given the glut of PhDs in many fields. And expertise in one field could transfer to others, if you split the expert task into 2 steps: identification of high-quality sources, and use of those sources to answer the user’s query.)
In any case, I imagine it should be possible to detect steganography to an extent, by adding noise to the contents of the context window and checking if that causes unexpectedly high performance degradation. You could have a second AI that creates a paraphrased view of everything in the context window except the most recent 20 words or so—I expect that’s a massive blow to steganography in practice. Note that the cost of such added-noise schemes only needs to be paid at training time, so they can be expensive—e.g. ensembling across the next-token generated from many different distortions of the current text.
Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let’s try to calculate how much it would cost though.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
I guess it’s doable. At least for something like one trillion tokens; you probably just can’t do fifty trillion tokens even if you were the CCP.
...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it’ll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now.
As for using paraphrasing to detect steganography: Yes, I’m hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I’m betraying my ignorance here and this idea doesn’t make sense for some reason?
I was editing my comment a fair amount, perhaps you read an old version of it?
And, in terms of demonstrating feasibility, you don’t need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it’s nearly as good as the original ChatGPT, I think you should be good to go.
I said “if you want to do the equivalent of a modern large training run.” If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that’ll be proportionately cheaper. And that might be good enough. But maybe we won’t be able to get to AGI that way.
On my inside model of how cognition works, I don’t think “able to automate all research but can’t do consequentialist reasoning” is a coherent property that a system could have. That is a strong claim, yes, but I am making it.
I agree that it is conceivable that LLMs embedded in CoT-style setups would be able to be transformative in some manner without “taking off”. Indeed, I touch on that in the post some: that scaffolded and slightly tweaked LLMs may not be “mere LLMs” as far as capability and safety upper bounds go.
That said, inasmuch as CoT-style setups would be able to turn LLMs into agents/general intelligences, I mostly expect that to be prohibitively computationally intensive, such that we’ll get to AGI by architectural advances before we have enough compute to make a CoT’d LLM take off.
But that’s a hunch based on the obvious stuff like AutoGPT consistently falling plus my private musings regarding how an AGI based on scaffolded LLMs would work (which I won’t share, for obvious reasons). I won’t be totally flabbergasted if some particularly clever way of doing that worked.
On my inside model of how cognition works, I don’t think “able to automate all research but can’t do consequentialist reasoning” is a coherent property that a system could have.
I actually basically agree with this quote.
Note that I said “incapable of doing non-trivial consequentialist reasoning in a forward pass”. The overall llm agent in the hypothetical is absolutely capable of powerful consequentialist reasoning, but it can only do this by reasoning in natural language. I’ll try to clarify this in my comment.
How about “able to automate most simple tasks where it has an example of that task being done correctly”? Something like that could make researchers much more productive. Repeat the “the most time consuming part of your workflow now requires effectively none of your time or attention” a few dozen times and that does end up being transformative compared to the state before the series of improvements.
I think “would this technology, in isolation, be transformative” is a trap. It’s easy to imagine “if there was an AI that was better at everything than we do, that would be tranformative”, and then look at the trend line, and notice “hey, if this trend line holds we’ll have AI that is better than us at everything”, and finally “I see lots of proposals for safe AI systems, but none of them safely give us that transformative technology”. But I think what happens between now and when AIs that are better than humans-in-2023 at everything matters.
I’m not particularly concerned about AI being “transformative” or not. I’m concerned about AGI going rogue and killing everyone. And LLMs automatic workflow is great and not (by itself) omnicidal at all, so that’s… fine?
But I think what happens between now and when AIs that are better than humans-in-2023 at everything matters.
As in, AIs boosting human productivity might/should let us figure out how to make stuff safe as it comes up, so no need to be concerned about us not having a solution to the endpoint of that process before we’ve made the first steps?
The problem is that boosts to human productivity also boost the speed at which we’re getting to that endpoint, and there’s no reason to think they differentially improve our ability to make things safe. So all that would do is accelerate us harder as we’re flying towards the wall at a lethal speed.
As in, AIs boosting human productivity might/should let us figure out how to make stuff safe as it comes up, so no need to be concerned about us not having a solution to the endpoint of that process before we’ve made the first steps?
I don’t expect it to be helpful to block individually safe steps on this path, though it would probably be wise to figure out what unsafe steps down this path look like concretely (which you’re doing!).
But yeah. I don’t have any particular reason to expect “solve for the end state without dealing with any of the intermediate states” to work. It feels to me like someone starting a chat application and delaying the “obtain customers” step until they support every language, have a chat architecture that could scale up to serve everyone, and have found a moderation scheme that works without human input.
I don’t expect that team to ever ship. If they do ship, I expect their product will not work, because I think many of the problems they encounter in practice will not be the ones they expected to encounter.
Interesting. My own musings regarding how an AGI based on scaffolded LLMs seems like it would not be prohibitively computationally expensive. Expensive, yes, but affordable in large projects.
It seems to me like para-human-level AGI is quite achievable with language model agents, but advancing beyond the human intelligence that created the LLM training set might be much slower. That could be a really good scenario.
You’ve probably seen my Capabilities and alignment of LLM cognitive architectures. I published that because it all of the ideas there seemed pretty obvious. To me those obvious improvements (a bit of work on episodic memory and executive function) lead to AGI with just maybe 10x more LLM calls than vanilla prompting (varying with problem/plan complexity of course). I’ve got a little more thinking beyond that which I’m not sure I should publish.
IMO, there isn’t anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can’t do non-trivial consequentialist reasoning in a forward pass.
Why not control the inputs more tightly/choose the response tokens at temperature=0?
Example:
Prompt A : Alice wants in the door
Promt B: Bob wants in the door
Available actions: 1. open, 2. keep_locked, 3. close_on_human
I believe you are saying with a weak forward pass the model architecture would be unable to reason “I hate Bob and closing the door on Bob will hurt Bob”, so it cannot choose (3).
But why not simply simplify the input? Model doesn’t need to know the name.
Prompt A: <entity ID_VALID wants in the door>
Prompt B: <entity ID_NACK wants in the door>
Restricting the overall context lets you use much more powerful models you don’t have to trust, and architectures you don’t understand.
Here are two specific objections to this post[1]:
AIs which aren’t qualitatively smarter than humans could be transformatively useful (e.g. automate away all human research).
It’s plausible that LLM agents will be able to fully obsolete human research while also being incapable of doing non-trivial consequentialist reasoning in just a forward pass (instead they would do this reasoning in natural language).
AIs which aren’t qualitatively smarter than humans could be transformatively useful
Perfectly aligned AI systems which were exactly as smart as humans and had the same capability profile, but which operated at 100x speed and were cheap would be extremely useful. In particular, we could vastly exceed all current alignment work in the span of a year.
In practice, the capability profile of AIs is unlikely to be exactly the same as humans. Further, even if the capability profile was the same, merely human level systems likely pose substantial misalignment concerns.
However, it does seem reasonably likely that AI systems will have a reasonably similar capability profile to humans and will also run faster and be cheaper. Thus, approaches like AI control could be very useful.
LLM agents with weak forward passes
IMO, there isn’t anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can’t do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses would fully apply. In the world where AIs basically can’t do invisible non-trivial consequentialist reasoning, most misalignment threat models don’t apply. (Scheming/deceptive alignmment and cleverly playing the training game both don’t apply.)
I think it’s unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.
I edited this comment from “I have two main objections to this post:” because that doesn’t quite seem like a good description of what this comment is saying. See this other comment for more meta-level commentary.
I expect the probability to be >> 15% for the following reasons.
There will likely still be incentives to make architectures more parallelizable (for training efficiency) and parallelizable architectures will probably be not-that-expressive in a single forward pass (see The Parallelism Tradeoff: Limitations of Log-Precision Transformers). CoT is known to increase the expressivity of Transformers, and the longer the CoT, the greater the gains (see The Expressive Power of Transformers with Chain of Thought). In principle, even a linear auto-regressive next-token predictor is Turing-complete, if you have fine-grained enough CoT data to train it on, and you can probably tradeoff between length (CoT supervision) complexity and single-pass computational complexity (see Auto-Regressive Next-Token Predictors are Universal Learners). We also see empirically that CoT and e.g. tools (often similarly interpretable) provide extra-training-compute-equivalent gains (see AI capabilities can be significantly improved without expensive retraining). And recent empirical results (e.g. Orca, Phi, Large Language Models as Tool Makers) suggest you can also use larger LMs to generate synthetic CoT-data / tools to train smaller LMs on.
This all suggests to me it should be quite likely possible (especially with a large, dedicated effort) to get to something like a ~human-level automated alignment researcher with a relatively weak forward pass.
For an additional intuition why I expect this to be possible, I can conceive of humans who would both make great alignment researchers while doing ~all of their (conscious) thinking in speech-like inner monologue and would also be terrible schemers if they tried to scheme without using any scheming-relevant inner monologue; e.g. scheming/deception probably requires more deliberate effort for some people on the ASD spectrum.
I agree & think this is pretty important. Faithful/visible CoT is probably my favorite alignment strategy.
I think o1 is significant evidence in favor of the story here; and I expect OpenAI’s model to be further evidence still if, as rumored, it will be pretrained on CoT synthetic data.
The weak single forward passes argument also applies to SSMs like Mamba for very similar theoretical reasons.
One additional probably important distinction / nuance: there are also theoretical results for why CoT shouldn’t just help with one-forward-pass expressivity, but also with learning. E.g. the result in Auto-Regressive Next-Token Predictors are Universal Learners is about learning; similarly for Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, Why Can Large Language Models Generate Correct Chain-of-Thoughts?, Why think step by step? Reasoning emerges from the locality of experience, Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks.
The learning aspect could be strategically crucial with respect to what the first transformatively-useful AIs should look like; also see e.g. discussion here and here. In the sense that this should add further reasons to think the first such AIs should probably (differentially) benefit from learning from data using intermediate outputs like CoT; or at least have a pretraining-like phase involving such intermediate outputs, even if this might be later distilled or modified some other way—e.g. replaced with [less transparent] recurrence.
More complex tasks ‘gaining significantly from longer inference sequences’ also seems beneficial to / compatible with this story.
I think those objections are important to mention and discuss, but they don’t undermine the conclusion significantly.
AIs which are qualitatively just as smart as humans could still be dangerous in the classic ways. The OP’s argument still applies to them, insofar as they are agentic and capable of plotting on the inside etc.
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we’d be in pretty damn good shape from an AI control perspective. I have been working on this myself & encourage others to do so also. I don’t think it undermines the OP’s points though? We are not currently on a path to have robust faithful CoT properties by default.
This post seemed overconfident in a number of places, so I was quickly pushing back in those places.
I also think the conclusion of “Nearly No Data” is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn’t seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.
If this post argued “the fact that current chat bots trained normally don’t seem to exhibit catastrophic misalignment isn’t much evidence about catastrophic misalignment in more powerful systems”, then I wouldn’t think this was overstated (though this also wouldn’t be very original). But, it makes stronger claims which seem false to me.
Mm, I concede that this might not have been the most accurate title. I might’ve let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.
My core point is something like “the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI’s cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for”.
I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.
OK, that seems reasonable to me.
Is there a citation for this?
What kind of citation are you looking for? Are you basically just asking me to provide evidence, or are you asking me to make an object-level argument (as opposed to e.g. an appeal to authority)? Or something else entirely, e.g. a publication?
You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.
If the argument hasn’t been made somewhere, perhaps you could write a short post making that argument. Could be a good way to either catalyze research in the area (you stated that you wish to encourage such research), or else convince people that the challenge is insurmountable and a different approach is needed.
You may have interpreted it that way, but I certainly don’t follow a policy of prefacing everything with “in my opinion” unless I have a citation ready. I bet you don’t either. Claims are by default just claims, not claims-presented-as-established-facts. If I wanted to present it as an established fact I would have done something to indicate that, e.g. cited something or said “it is well-known that...”
Anyhow, I’m happy to defend the claim here. It would help if I knew where to begin. I’ll just briefly link a couple things here to get started and then we can zoom in on whatever bits you are most interested in.
First: There are probably going to be incentives for AIs to conceal their thoughts sometimes. Sometimes this will allow them to perform better in training, for example. Link to example post making this point, though many others have made it also.
Second: Some AI designs involve a natural language bottleneck; the only way for the system to communicate with its future self is via outputting tokens that the future self then reads in input. This gives us hope that it’ll conveniently-for-us put all the important thoughts in natural language, so that e.g. it won’t leave out anything important such as “the reason I’m telling you this is not because I’m trying to help you but rather because I think this’ll cause me to score more highly” etc. Here’s a post articulating this hope.
Third: Alas, the incentives mentioned above will probably crush this hope. Maybe if we didn’t do RLHF for example, and just used pre-trained models with clever prompts, we’d be fine. But by default we’ll apply RLHF to the whole chain of thought, and not leave any secret scratchpad, for example. The post above talks about this briefly.
Fourth: Even if that wasn’t true, the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don’t have a natural language bottleneck. My guess is that they’ll come into play right around the time we hit AGI.
My positive hope is that we can basically draw attention to this issue and figure out what sorts of training schemes create & preserve faithful CoT properties, and which don’t. (Seems pretty doable to me, it just takes some engineer and scientist legwork) and then coordinate governments and labs to resist the competitive pressure to train using schemes that don’t have those properties (also doable, if we had well-established science of those properties, which we don’t yet).
This one looks fatal. (I think the rest of the reasons could be dealt with somehow.)
What existing alternative architectures do you have in mind? I guess mamba would be one?
Do you think it’s realistic to regulate this? F.e. requiring that above certain size, models can’t have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)
I think it would be realistic to regulate this if the science of faithful CoT was better developed. If there were lots of impressive papers to cite about CoT faithfulness for example, and lots of whitepapers arguing for the importance of faithfulness to alignment and safety.
As it is, it seems unlikely to be politically viable… but maybe it’s still worth a shot?
Yeah, true. But it’s also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there’s less resistance, it doesn’t break anyone’s plans.
Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won’t see clear differences yet.) The hard part would be finding the right tasks.
Agreed. I was working on this for six months and I’ve been trying to get more people to work on it.
We don’t have a way of measuring CoT faithfulness as far as I know, in general—but you emphasize “tasks where we can evaluate...” that seems intriguing to me, you are saying it may be feasible today for some tasks at least. What tasks do you have in mind?
Unfortunately I didn’t have any particular tasks in mind when I wrote it. I was vaguely thinking about settings as in:
https://arxiv.org/pdf/2305.04388.pdf
https://arxiv.org/pdf/2307.13702.pdf
Now that I though about it, for this particular transformers vs mamba experiment, I’d go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:
and then have a CoT:
And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on.
(It’s an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.)
Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won’t have a limit.
Can you say what you tried in these six months and how did it go?
Sorry, I think I must have misunderstood your comment. When you wrote:
I interpreted this to mean that in addition to weak forward passes, there was another thing called “robust faithful CoT properties” that would also need to be achieved.
I now think you meant to indicate that “weak forward passes” was a particular method for achieving “robust faithful CoT properties”.
Ah, I think I see how I was being unclear maybe now, thanks for that question. It may help for me to go through the original quote and say what I do and don’t agree with:
I agree with everything in this quote; however, I think that “assuming that we can also rule out steganography and similar concerns” is a big and unwarranted assumption, for reasons mentioned above. Our training schemes will probably incentivize steganography and subtle strategic bias.
If I understand correctly, sophisticated steganography will not be incentivized by a language model that’s just trained on next-token prediction (no RHLF). Is that right? (Since the model didn’t generate any of the text in its context window, there’s no method it could use to pass info to itself. The best it could do is pick up on weak-to-nonexistent human-created steganography in its dataset.)
That suggests a simple method to prevent steganography: Instead of having human raters rate RHLF-generated responses, have humans write responses themselves, then finetune on the human-written responses.
That might sound labor-intensive, but I don’t think it has to be. Instead of scoring your AI’s response, you check if the AI-written response is adequate. If it’s adequate, you do nothing. If it’s inadequate, you send the question to an expert who writes an expert-level response, then add that response to the finetuning dataset.
Sure, experts are expensive, but so are massive training runs. We know data quality is important at the language corpus stage. Why wouldn’t it be equally important at the finetuning stage? (Also, I’m not sure experts are that expensive, given the glut of PhDs in many fields. And expertise in one field could transfer to others, if you split the expert task into 2 steps: identification of high-quality sources, and use of those sources to answer the user’s query.)
In any case, I imagine it should be possible to detect steganography to an extent, by adding noise to the contents of the context window and checking if that causes unexpectedly high performance degradation. You could have a second AI that creates a paraphrased view of everything in the context window except the most recent 20 words or so—I expect that’s a massive blow to steganography in practice. Note that the cost of such added-noise schemes only needs to be paid at training time, so they can be expensive—e.g. ensembling across the next-token generated from many different distortions of the current text.
Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let’s try to calculate how much it would cost though.
...I mean if you want to do the equivalent of a modern large training run you’ll need trillions of tokens of expert-generated text. So that’s a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren’t slacking off or cheating or trying to poison your dataset?
I guess it’s doable. At least for something like one trillion tokens; you probably just can’t do fifty trillion tokens even if you were the CCP.
...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it’ll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now.
As for using paraphrasing to detect steganography: Yes, I’m hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.
Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I’m betraying my ignorance here and this idea doesn’t make sense for some reason?
I was editing my comment a fair amount, perhaps you read an old version of it?
And, in terms of demonstrating feasibility, you don’t need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it’s nearly as good as the original ChatGPT, I think you should be good to go.
I said “if you want to do the equivalent of a modern large training run.” If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that’ll be proportionately cheaper. And that might be good enough. But maybe we won’t be able to get to AGI that way.
Worth a shot though.
On my inside model of how cognition works, I don’t think “able to automate all research but can’t do consequentialist reasoning” is a coherent property that a system could have. That is a strong claim, yes, but I am making it.
I agree that it is conceivable that LLMs embedded in CoT-style setups would be able to be transformative in some manner without “taking off”. Indeed, I touch on that in the post some: that scaffolded and slightly tweaked LLMs may not be “mere LLMs” as far as capability and safety upper bounds go.
That said, inasmuch as CoT-style setups would be able to turn LLMs into agents/general intelligences, I mostly expect that to be prohibitively computationally intensive, such that we’ll get to AGI by architectural advances before we have enough compute to make a CoT’d LLM take off.
But that’s a hunch based on the obvious stuff like AutoGPT consistently falling plus my private musings regarding how an AGI based on scaffolded LLMs would work (which I won’t share, for obvious reasons). I won’t be totally flabbergasted if some particularly clever way of doing that worked.
I actually basically agree with this quote.
Note that I said “incapable of doing non-trivial consequentialist reasoning in a forward pass”. The overall llm agent in the hypothetical is absolutely capable of powerful consequentialist reasoning, but it can only do this by reasoning in natural language. I’ll try to clarify this in my comment.
How about “able to automate most simple tasks where it has an example of that task being done correctly”? Something like that could make researchers much more productive. Repeat the “the most time consuming part of your workflow now requires effectively none of your time or attention” a few dozen times and that does end up being transformative compared to the state before the series of improvements.
I think “would this technology, in isolation, be transformative” is a trap. It’s easy to imagine “if there was an AI that was better at everything than we do, that would be tranformative”, and then look at the trend line, and notice “hey, if this trend line holds we’ll have AI that is better than us at everything”, and finally “I see lots of proposals for safe AI systems, but none of them safely give us that transformative technology”. But I think what happens between now and when AIs that are better than humans-in-2023 at everything matters.
I’m not particularly concerned about AI being “transformative” or not. I’m concerned about AGI going rogue and killing everyone. And LLMs automatic workflow is great and not (by itself) omnicidal at all, so that’s… fine?
As in, AIs boosting human productivity might/should let us figure out how to make stuff safe as it comes up, so no need to be concerned about us not having a solution to the endpoint of that process before we’ve made the first steps?
The problem is that boosts to human productivity also boost the speed at which we’re getting to that endpoint, and there’s no reason to think they differentially improve our ability to make things safe. So all that would do is accelerate us harder as we’re flying towards the wall at a lethal speed.
I don’t expect it to be helpful to block individually safe steps on this path, though it would probably be wise to figure out what unsafe steps down this path look like concretely (which you’re doing!).
But yeah. I don’t have any particular reason to expect “solve for the end state without dealing with any of the intermediate states” to work. It feels to me like someone starting a chat application and delaying the “obtain customers” step until they support every language, have a chat architecture that could scale up to serve everyone, and have found a moderation scheme that works without human input.
I don’t expect that team to ever ship. If they do ship, I expect their product will not work, because I think many of the problems they encounter in practice will not be the ones they expected to encounter.
Interesting. My own musings regarding how an AGI based on scaffolded LLMs seems like it would not be prohibitively computationally expensive. Expensive, yes, but affordable in large projects.
It seems to me like para-human-level AGI is quite achievable with language model agents, but advancing beyond the human intelligence that created the LLM training set might be much slower. That could be a really good scenario.
The excellent On the future of language models raises that possibility.
You’ve probably seen my Capabilities and alignment of LLM cognitive architectures. I published that because it all of the ideas there seemed pretty obvious. To me those obvious improvements (a bit of work on episodic memory and executive function) lead to AGI with just maybe 10x more LLM calls than vanilla prompting (varying with problem/plan complexity of course). I’ve got a little more thinking beyond that which I’m not sure I should publish.
Why not control the inputs more tightly/choose the response tokens at temperature=0?
Example:
Prompt A : Alice wants in the door
Promt B: Bob wants in the door
Available actions: 1. open, 2. keep_locked, 3. close_on_human
I believe you are saying with a weak forward pass the model architecture would be unable to reason “I hate Bob and closing the door on Bob will hurt Bob”, so it cannot choose (3).
But why not simply simplify the input? Model doesn’t need to know the name.
Prompt A: <entity ID_VALID wants in the door>
Prompt B: <entity ID_NACK wants in the door>
Restricting the overall context lets you use much more powerful models you don’t have to trust, and architectures you don’t understand.