o1 is a bad idea
This post comes a bit late with respect to the news cycle, but I argued in a recent interview that o1 is an unfortunate twist on LLM technologies, making them particularly unsafe compared to what we might otherwise have expected:
The basic argument is that the technology behind o1 doubles down on a reinforcement learning paradigm, which puts us closer to the world where we have to get the value specification exactly right in order to avert catastrophic outcomes.
Additionally, this technology takes us further from interpretability. If you ask GPT4 to produce a chain-of-thought (with prompts such as “reason step-by-step to arrive at an answer”), you know that in some sense, the natural-language reasoning you see in the output is how it arrived at the answer.[1] This is not true of systems like o1. The o1 training rewards any pattern which results in better answers. This can work by improving the semantic reasoning which the chain-of-thought apparently implements, but it can also work by promoting subtle styles of self-prompting. In principle, o1 can learn a new internal language which helps it achieve high reward.
You can tell the RL is done properly when the models cease to speak English in their chain of thought
- Andrej Karpathy
A loss of this type of (very weak) interpretability would be quite unfortunate from a practical safety perspective. Technology like o1 moves us in the wrong direction.
Informal Alignment
The basic technology currently seems to have the property that it is “doing basically what it looks like it is doing” in some sense. (Not a very strong sense, but at least, some sense.) For example, when you ask ChatGPT to help you do your taxes, it is basically trying to help you do your taxes.
This is a very valuable property for AI safety! It lets us try approaches like Cognitive Emulation.
In some sense, the Agent Foundations program at MIRI sees the problem as: human values are currently an informal object. We can only get meaningful guarantees for formal systems. So, we need to work on formalizing concepts like human values. Only then will we be able to get formal safety guarantees.
Unfortunately, fully formalizing human values appears to be very difficult. Human values touch upon basically all of the human world, which is to say, basically all informal concepts. So it seems like this route would need to “finish philosophy” by making an essentially complete bridge between formal and informal. (This is, arguably, what approaches such as Natural Abstractions are attempting.)
Approaches similar to Cognitive Emulation lay out an alternative path. Formalizing informal concepts seems hard, but it turns out that LLMs “basically succeed” at importing all of the informal human concepts into a computer. GPT4 does not engage in the sorts of naive misinterpretations which were discussed in the early days of AI safety. If you ask it for a plan to manufacture paperclips, it doesn’t think the best plan would involve converting all the matter in the solar system into paperclips. If you ask for a plan to eliminate cancer, it doesn’t think the extermination of all biological life would count as a success.
We know this comes with caveats; phenomena such as adversarial examples show that the concept-borders created by modern machine learning are deeply inhuman in some ways. The computerized versions of human commonsense concepts are not robust to optimization. We don’t want to naively optimize these rough mimics of human values.
Nonetheless, these “human concepts” seem robust enough to get a lot of useful work out of AI systems, without automatically losing sight of ethical implications such as the preservation of life. This might not be the sort of strong safety guarantee we would like, but it’s not nothing. We should be thinking about ways to preserve these desirable properties going forward. Systems such as o1 threaten this.
- ^
Yes, this is a fairly weak sense. There is a lot of computation under the hood in the big neural network, and we don’t know exactly what’s going on there. However, we also know “in some sense” that the computation there is relatively weak. We also know it hasn’t been trained specifically to cleverly self-prompt into giving a better answer (unlike o1); it “basically” interprets its own chain-of-thought as natural language, the same way it interprets human input.
So, to the extent that the chain-of-thought helps produce a better answer in the end, we can conclude that this is “basically” improved due to the actual semantic reasoning which the chain-of-thought apparently implements. This reasoning can fail for systems like o1.
I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo’s doc along similar lines.
There seems to be quite a lot of low-hanging fruit here! I’m optimistic that highly-faithful CoTs can demonstrate enough momentum and short-term benefits to win out over uninterpretable neuralese, but it could easily go the other way. I think way more people should be working on methods to make CoT faithfulness robust to optimization (and how to minimize the safety tax of such methods).
It depends on what the alternative is. Here’s a hierarchy:
1. Pretrained models with a bit of RLHF, and lots of scaffolding / bureaucracy / language model programming.
2. Pretrained models trained to do long chains of thought, but with some techniques in place (e.g. paraphraser, shoggoth-face, see some of my docs) that try to preserve faithfulness.
3. Pretrained models trained to do long chains of thought, but without those techniques, such that the CoT evolves into some sort of special hyperoptimized lingo
4. As above except with recurrent models / neuralese, i.e. no token bottleneck.
Alas, my prediction is that while 1>2>3>4 from a safety perspective, 1<2<3<4 from a capabilities perspective. I predict that the corporations will gleefully progress down this chain, rationalizing why things are fine and safe even as things get substantially less safe due to the changes.
Anyhow I totally agree on the urgency, tractability, and importance of faithful CoT research. I think that if we can do enough of that research fast enough, we’ll be able to ‘hold the line’ at stage 2 for some time, possibly long enough to reach AGI.
Do you have thoughts on how much it helps that autointerp seems to now be roughly human-level on some metrics and can be applied cheaply (e.g. https://transluce.org/neuron-descriptions), so perhaps we might have another ‘defensive line’ even past stage 2 (e.g. in the case of https://transluce.org/neuron-descriptions, corresponding to the level of ‘granularity’ of autointerp applied to the activations of all the MLP neurons inside an LLM)?
Later edit: Depending on research progress (especially w.r.t. cost effectiveness), other levels of ‘granularity’ might also become available (fully automated) soon for monitoring, e.g. sparse (SAE) feature circuits (of various dangerous/undesirable capabilities), as demo-ed in Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
I don’t have much to say about the human-level-on-some-metrics thing. evhub’s mechinterp tech tree was good for laying out the big picture, if I recall correctly. And yeah I think interpretability is another sort of ‘saving throw’ or ‘line of defense’ that will hopefully be ready in time.
The good news I’ll share is that some of the most important insights about the safety/alignment work done on LLMs do transfer over pretty well to a lot of plausible AGI architectures, so while there’s a little safety loss each time you go from 1 to 4, a lot of the theoretical ways to achieve alignment of these new systems remain intact, though the danger here is that the implementation difficulty pushes the safety tax too high, which is a pretty real concern.
Specifically, the insights I’m talking about are the controllability of AI with data, combined with their feedback on RL being way denser than human RL from evolution, meaning that instrumental convergence is affected significantly.
I think that IF we could solve alignment for level 2 systems, many of the lessons (maybe most? Maybe if we are lucky all?) would transfer to level 4 systems. However, time is running out to do that before the industry moves on beyond level 2.
Let me explain.
Faithful CoT is not a way to make your AI agent share your values, or truly strive towards the goals and principles you dictate in the Spec. Instead, it’s a way to read your AI’s mind, so that you can monitor its thoughts and notice when your alignment techniques failed and its values/goals/principles/etc. ended up different from what you hoped.
So faithful CoT is not by itself a solution, but it’s a property which allows us to iterate towards a solution.
Which is great! But we better start iterating fast because time is running out. When the corporations move on to e.g. level 3 and 4 systems, they will be loathe to spend much compute tinkering with legacy level 2 systems, much less train new advanced level 2 systems.
I mostly just agree with your comment here, and IMO the things I don’t exactly get aren’t worth disagreeing too much about, since it’s more in the domain of a reframing than an actual disagreement.
I don’t get why you think this is true? EG, it seems like almost no insights about how to train faithful CoT would transfer to systems speaking pure neuralese. It seems to me that what little safety/alignment we have for LLMs is largely a byproduct of how language-focused they are (which gives us a weak sort of interpretability, a very useful safety resource which we are at risk of losing soon).
To put it a different way: human imitation is a relatively safe outer-optimization target, so we should be concerned to the extent that the industry is moving away from it. It sounds like you think safety lessons from the human-imitation regime generalize beyond the human-imitation regime. Maybe I agree that we can derive some abstract lessons like “RL with really dense feedback can avoid instrumental convergence”, but we’re moving away from the regime where such dense feedback is available, so I don’t see what lessons transfer.
I think the crux is I think that the important parts of of LLMs re safety isn’t their safety properties specifically, but rather the evidence they give to what alignment-relevant properties future AIs have (and note that I’m also using evidence from non-LLM sources like MCTS algorithm that was used for AlphaGo), and I also don’t believe interpretability is why LLMs are mostly safe at all, but rather I think they’re safe due to a combo of incapacity, not having extreme instrumental convergence, and the ability to steer them with data.\
Language is a simple example, but one that is generalizable pretty far.
Note that the primary points would apply to basically a whole lot of AI designs like MCTS for AlphaGo or a lot of other future architecture designs which don’t imitate humans, barring ones which prevent you from steering it at all with data, or have very sparse feedback, which translates into weakly constraining instrumental convergence.
I think this is a crux, in that I don’t buy o1 as progressing to a regime where we lose so much dense feedback that it’s alignment relevant, because I think sparse-feedback RL will almost certainly be super-uncompetitive with every other AI architecture until well after AI automates all alignment research.
Also, AIs will still have instrumental convergence, it’s just that their goals will be more local and more focused around the training task, so unless the training task rewards global power-seeking significantly, you won’t get it.
[insert standard skepticism about these sorts of generalizations when generalizing to superintelligence]
But what lesson do you think you can generalize, and why do you think you can generalize that?
So, as a speculative example, further along in the direction of o1 you could have something like MCTS help train these things to solve very difficult math problems, with the sparse feedback being given for complete formal proofs.
Similarly, playing text-based video games, with the sparse feedback given for winning.
Similarly, training CoT to reason about code, with sparse feedback given for predictions of the code output.
Etc.
You think these sorts of things just won’t work well enough to be relevant?
To answer the question:
Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won’t work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.
Other AIs relying on much denser feedback will already rule the world before that happens.
Alright, I’ll give 2 lessons that I do think generalize to superintelligence:
The data is a large factor in both it’s capabilities and alignment, and alignment strategies should not ignore the data sources when trying to make predictions or trying to intervene on the AI for alignment purposes.
Instrumental convergence in a weak sense will likely exist, because having some ability to get more resources are useful for a lot of goals, but the extremely unconstrained versions of instrumental convergence often assumed where an AI will grab so much power as to effectively control humanity is unlikely to exist, given the constraints and feedback given to the AI.
For 1, the basic answer for why is because a lot of AI success in fields like Go and language modeling etc was jumpstarted by good data.
More importantly, I remember this post, and while I think it overstates things in stating that an LLM is just the dataset (it probably isn’t now with o1), it does matter that LLMs are influenced by their data sources.
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
For 2, the basic reason for this is that the strongest capabilities we have seen that come out of RL either require immense amounts of data on pretty narrow tasks, or non-instrumental world models.
This is because constraints prevent you from having to deal with the problem where you produce completely useless RL artifacts, and evolution got around this constraint by accepting far longer timelines and far more computation in FLOPs than the world economy can tolerate.
Ah, I wasn’t thinking “sparse” here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence).
I still think o1 is moving towards chess on this spectrum.
Oh, now I understand.
And AIs have already been superhuman at chess for very long, yet that domain gives very little incentive for very strong instrumental convergence.
I am claiming that for practical AIs, the results of training them in the real world with goals will give them instrumental convergence, but without further incentives, will not give them so much instrumental convergence that it leads to power-seeking to disempower humans by default.
Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates).
LLMs touch on the real world far more than that, such that MCTS-like skill at navigating “the LLM world” in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.
I agree chess is an extreme example, such that I think that more realistic versions would probably develop instrumental convergence at least in a local sense.
(We already have o1 at least capable of a little instrumental convergence.)
My main substantive claim is that constraining instrumental goals such that the AI doesn’t try to take power via long-term methods is very useful for capabilities, and more generally instrumental convergence is an area where there is a positive manifold for both capabilities and alignment, where alignment methods increase capabilities and vice versa.
There’s a regularization problem to solve for 3.9 and 4, and it’s not obvious to me that glee will be enough to solve it (3.9 = “unintelligible CoT”).
I’m not sure how o1 works in detail, but for example, backtracking (which o1 seems to use) makes heavy use of the pretrained distribution to decide on best next moves. So, at the very least, it’s not easy to do away with the native understanding of language. While it’s true that there is some amount of data that will enable large divergences from the pretrained distribution—and I could imagine mathematical proof generation eventually reaching this point, for example—more ambitious goals inherently come with less data, and it’s not obvious to me that there will be enough data in alignment-critical applications to cause such a large divergence.
There’s an alternative version of language invention where the model invents a better language for (e.g.) maths then uses that for more ambitious projects, but that language is probably quite intelligible!
When I imagine models inventing a language my imagination is something like Shinichi Mochizuki’s Inter-universal Teichmüller theory invented for his supposed proof of abc conjecture. It is clearly something like mathematical English and you could say it is “quite intelligible” compared to “neuralese”, but at the end, it is not very intelligible.
Mathematical reasoning might be specifically conducive to language invention because our ability to automatically verify reasoning means that we can potentially get lots of training data. The reason I expect the invented language to be “intelligible” is that it is coupled (albeit with some slack) to automatic verification.
Thanks for this response! I agree with the argument. I’m not sure what it would take to ensure CoT faithfulness, but I agree that it is an important direction to try and take things; perhaps even the most promising direction for near-term frontier-lab safety research (given the incentives pushing those labs in o1-ish directions).
I’m somewhat surprised by this paragraph. I thought the MIRI position was that they did not in fact predict AIs behaving like this, and the behavior of GPT4 was not an update at all for them. See this comment by Eliezer. I mostly bought that MIRI in fact never worried about AIs going rouge based on naive misinterpretations, so I’m surprised to see Abram saying the opposite now.
Abram, did you disagree about this with others at MIRI, so the behavior of GPT4 was an update for you but not for them, or do you think they are misremembering/misconstructing their earlier thoughts on this matter, or is there a subtle distinction here that I’m missing?
“Misinterpretation” is somewhat ambiguous. It either means not correctly interpreting the intent of an instruction (and therefore also not acting on that intent) or correctly understanding the intent of the instruction while still acting on a different interpretation. The latter is presumably what the outcome pump was assumed to do. LLMs can apparently both understand and act on instructions pretty well. The latter was not at all clear in the past.
I more-or-less agree with Eliezer’s comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn’t know him in 1996). I have a small beef with his bolded “MIRI is always in every instance” claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they’ve ever said while employed at MIRI).
What I am trying to say is something looser and more gestalt. I do think what I am saying contains some disagreement with some spirit-of-MIRI, and possibly some specific others at MIRI, such that I could say I’ve updated on the modern progress of AI in a different way than they have.
For example, in my update, the modern progress of LLMs points towards the Paul side of some Eliezer-Paul debates. (I would have to think harder about how to spell out exactly which Eliezer-Paul debates.)
One thing I can say is that I myself often argued using “naive misinterpretation”-like cases such as the paperclip example. However, I was also very aware of the Eliezer-meme “the AI will understand what the humans mean, it just won’t care”. I would have predicted difficulty in building a system which correctly interprets and correctly cares about human requests to the extent that GPT4 does.
This does not mean that AI safety is easy, or that it is solved; only that it is easier than I anticipated at this particular level of capability.
Getting more specific to what I wrote in the post:
My claim is that modern LLMs are “doing roughly what they seem like they are doing” and “internalize human intuitive concepts”. This does include some kind of claim that these systems are more-or-less ethical (they appear to be trying to be helpful and friendly, therefore they “roughly are”).
The reason I don’t think this contradicts with Eliezer’s bolded claim (“Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model”) is that I read Eliezer as talking about strongly superhuman AI with this claim. It is not too difficult to get something into the values of some basic reinforcement learning agent, to the extent that something like that has values worth speaking of. It gets increasingly difficult as the agent gets cleverer. At the level of intelligence of, say, GPT4, there is not a clear difference between getting the LLM to really care about something vs merely getting those values into its predictive model. It may be deceptive or honest; or, it could even be meaningless to classify it as deceptive or honest. This is less true of o1, since we can see it actively scheming to deceive.
This is bad, but perhaps there is a silver lining.
If internal communication within the scaffold appears to be in plain English, it will tempt humans to assume the meaning coincides precisely with the semantic content of the message.
If the chain of thought contains seemingly nonsensical content, it will be impossible to make this assumption.
It seems likely that process supervision was used for o1. I’d be curious to what extent it addresses the concerns here, if a supervision model assesses that each reasoning step is correct, relevant, and human-understandable. Even with process supervision, o1 might give a final answer that essentially ignores the process or uses some self-prompting. But process supervision also feels helpful, especially when the supervising model is more human-like, similar to pre-o1 models.
Process supervision seems like a plausible o1 training approach but I think it would conflict with this:
I think it might just be outcome-based RL, training the CoT to maximize the probability of correct answers or maximize human preference reward model scores or minimize next-token entropy.
It can be both, of course. Start with process supervision but combine it with… something else. It’s hard to learn how to reason from scratch, but it’s also clearly not doing pure strict imitation learning, because the transcripts & summaries are just way too weird to be any kind of straightforward imitation learning of expert transcripts (or even ones collected from users or the wild).
Wouldn’t that conflict with the quote? (Though maybe they’re not doing what they’ve implied in the quote)
My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for “policy compliance or user preferences.” This way they make it useful, and they don’t incentivize it to hide dangerous thoughts. I’m not confident about this though.
Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing.
One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not.
A lot of possible process-supervision procedures will not do anything to help prevent this. For example, rewarding CoT for being human-understandable won’t help, because the reward signal there can miss the subtle patterns being used. It only avoids the outcome where the CoT looks non-human-readable; it doesn’t avoid the outcome where the CoT actually contains a bunch of human-unreadable patterns.
Using paraphrasing to avoid this is heavily dependent on the strength of the paraphraser. The paraphraser needs to have really good coverage of alternative ways to convey the same semantic information, so that any non-human-readable patterns get randomized. The paraphraser itself can’t be vulnerable to steering by adversarial inputs.
I like the intuition behind this argument, which I don’t remember seeing spelled out anywhere else before.
I wonder how much hope one should derive from the fact that, intuitively, RL seems like it should be relatively slow at building new capabilities from scratch / significantly changing model internals, so there might be some way to buy some safety from also monitoring internals (both for dangerous capabilities already existant after pretraining, and for potentially new ones [slowly] built through RL fine-tuning). Related passage with an at least somewhat similar intuition, from https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit#How_to_Catch_an_LLM_in_the_Act_of_Repurposing_Deceitful_Behaviors (the post also discusses how one might go about monitoring for dangerous capabilities already existant after pretraining):
I agree with this analysis. I mean, I’m not certain further optimization will erode the interpretability of the generated CoT, its possible the fact its pretrained to use human natural language pushes it in a stable equilibrium, but I don’t think so, there are ways the CoT can become less interpretable in a step-wise fashion.
But this is the way its going, seems inevitable to me. Just scaling up models and then training them on English language internet text, is clearly less efficient (from a “build AGI” perspective, and from a profit-perspective) than training them to do the specific tasks that the users of the technology want. So thats the way its going.
And once you’re training the models this way, the tether between human-understandable concepts and the CoT will be completely destroyed. If they stay together, it will just be because its kind of a stable initial condition.
I agree with your technical points, but I don’t think that we could particularly expect the other path. Safety properties of LLMs seem to be desirable from extremely safety-pilled point of view, not from perspective of average capabilities researcher and RL seems to be The Answer to many learning problems.
unless i’m misunderstanding you or MIRI, that’s not their primary concern at all:
Your making a mistake : “If you ask it for a plan to manufacture paperclips, it doesn’t think the best plan would involve converting all the matter in the solar system into paperclips. If you ask for a plan to eliminate cancer, it doesn’t think the extermination of all biological life would count as a success.”
When you ask GPT for such things, you’re not setting it’s internal goal. You’re just interracting with him. His internal goal might look like “Provide answer that are likely to satisfy the human , and write only potato if the human provide the exact string “s8r5d6s8rd4c8e” “
You have no idea what’s exactly it internal goal, but you’re just seeing that so far its providing satisfying answers.
Such misalligned internal goals can still bite us especially if the there are specific outliers not present in the training dataset
My point here is that at the capability level of GPT4, this distinction isn’t very important. There’s no way to know for sure until it is too late, of course, but it seems pretty plausible that GPT4 isn’t cleverly scheming. It is merely human-level at deception, and doesn’t pursue any coherent overarching goal with it. It clumsily muddles through with mildly-above-average-for-human convincingness. For most queries (it seems plausible to me) it isn’t even adequately coherent to make a crisp distinction between whether it’s honestly trying to answer the question vs deceptively trying to make an answer look good; at its level of capability, it’s mostly the same thing one way or the other. The exceptions to this “mostly” aren’t strategic enough that we expect them to route around obstacles cleverly.
It isn’t much, but it is more than I naively expected.
I appreciated this! I didn’t know much about o1, and this gives me a much better understanding of how it’s different. My brain finds Abram very trustworthy for some reason.