I’m a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I’m actively looking for employment working in this area, preferably in the UK — meanwhile I’ll be participating in SERI MATS summer 2025. I will also be attending LessOnline.
RogerDearnaley
There are certainly things that it’s easier to do with RL — whether it’s ever an absolute requirement I’m less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that’s the case I’m not familiar with the details — I’d love references to anything relevant to this, if anyone has them.
My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it’s basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.
I really think you need a proof of concept with text, rather than images. I’d suggest targeting one of the smaller TinyStories models (perhaps a 1-bit or 1-trit quantized version of one). Then I’d look for some sort of parallel to an alignment property: e.g. without just hard-coding it, can you modify the code to guarantee (at the “convincing argument” level, not formal proof) some property of the interactions between child characters and parent characters in the stories?
Aligning AI representatives / advisors to individual humans: If every human had a competitive and aligned AI representative which gave them advice on how to advance their interests as well as just directly pursuing their interests based on their direction (and this happened early before people were disempowered), this would resolve most of these concerns.
My personal prediction is that this would result in vast coordination problems that would likely rapidly lead to war and x-risk. You need a mechanism to produce a consensus or social compact, one that is at least as effective as our existing mechanisms, preferably more so. (While thinking about this challenge, please allow for the fact that 2–4% of humans are sociopathic, so an AI representative representing their viewpoint is likely to be significantly less prosocial.)
Possibly you were concealing some assumptions of pro-social/coordination behavior inside the phrase “aligned AI representative” — I read that as “aligned to them, and them only, to the exclusion of the rest of society — since they had it realigned that way”, but possibly that’s not how you meant it?
Incidentally, there are a great many variant versions of chess with different piece-move rules (collectively sometimes called “fairy chess”), and I think even quite a lot of collected games for some of the more popular rule variants. Training an AI to play many types of fairy chess, and even arbitrary new just-invented ones, might be an interesting project that covers some aspects of generalizing out-of-distribution and positive transfer. A suitably-edited-for-the-variant version of Stockfish makes a pretty strong baseline for this. Using AlphaZero per variant is another obvious baseline.
There’s not a lot of scope for aligned/unaligned behavior in Go (or chess): it’s a zero-sum game, so I don’t see how any Go plays could be labeled as aligned or unaligned. How about some complex tactical or simulation game that actually has a scope for aligned/unaligned or at least moral/immoral behavior? Ideally one where you are roleplaying as an AI, so aligned behavior is appropriate, or at least doing some sort of resource management or strategy task that might get assigned to an AI.
Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we’d get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.
The SGD safety pretraining equivalent would be to include that transcript in the pretraining dataset (or, since such data is very rare and useful/high quality, perhaps an entrepreneurship-specific fine-tuning dataset). So far, very similar. You would also (likely AI-assisted) look through all of the transcript, and if you located any portions where the behavior was less wise or less moral/aligned than the behavior we’d like to see from an aligned AI-entrepreneur, label that potion with <|unaligned|> tags (or whatever), and perhaps also supplement it with commentary on subject like why it is less wise/moral/aligned than the standards for an aligned AI, what should have been done instead, and speculations around the likely results of those counterfactual actions.
[Seth, I owe you a reply to your lengthy and thoughtful comment — I aim to get to this in the next day or two.]
Why would we have to use RL to do this? The problem of building a rater for RL closely resembles automating the labelling problem for preparing the dataset for SGD safety pretraining, except that for online RL the rater is harder: it has to run fast, it can’t be human assisted, and it has to be able to cope with arbitrary adversarial shifts in the distribution being rated and do so well enough for it to not have exploitable flaws. A rater for (or at least attaching ratings to the episode set for) offline RL is less bad: it’s an almost equivalent problem to labelling a dataset for SGD, just attaching a score rather than a binary classification. The primary difference is that for the security pretraining approach the behavior we’re training into the model is a classifier that labels behavior either good or bad, so isn’t prone to Goodharting when you run it and ask for output from just one of the two categories, whereas for offline RL we’re training a policy that tries to maximize the goodness rating, so is prone to Goodharting when the gradient towards the very “best” behavior leads it outside the training distribution. (The reason the SGD-trained classifier is safe is closely related to the satisficing approach to avoid Goodhart’s Law.) So from the rating and stability point of view online RL is more challenging than offline RL, which is more challenging than security pretraining SGD.
Can you (or anyone) explain to me why there could be a problem that we can only solve using RL on rated examples, and could not do via SGD on labeled examples? Why do you think there at least a possibility that RL could be the only way to train a frontier system that’s human-level or above? I’m not currently seeing any potential advantage of RL — other than the fact it induces distribution shifts, during training for online RL, or after it for offline RL, so doesn’t require us to already know the distribution we want: but these distribution shifts are exactly the source of its danger.
Let me give you a detailed presciption. For whatever RL training scheme you think we need, convert the rater for that to a satisficing binary classifier (classes: good enough vs not good enough behavior), and run it over large training set of episodes matching the distribution of data you want your model to produce. Do SGD pretraining from that, and condition the generation from the result on the “good” label. My claim is that the output will be functionally equivalent your RL trained model, but its behavior will be more predictable in advance from the training set since there are no inherent distribution shifts. For there to be possibility that RL could be the only way to train a frontier system that’s human-level or above, either this would need to be false, or some aspect of the proposed input would need to not be computable/generatable for us, other than via the RL training process (whose output can clearly generate this). Which of these are you proposing might occur?
We don’t need it to work in the infinite limit. (Personally, I’m assuming we’ll only be using this to align approximately-human-level research assistants to help us do AI-Assisted Alignment research — so at a level where if we failed, it might not be automatically disastrous.)
My concern is that, if you’re using RL to train a frontier system that’s human-level or above, for alignment or capabilities purposes, is that it will inevitably find ways to abuse flaws in out RL rating system. One exception might be if the RL is for some capability like reasoning to produce a proof that passes proof checking, where it might be possible to create a rating system that actually has no flaws to exploit. I don’t see how we could do that for RL for alignment, however.
I presume solutions do exist that aren’t prohibitively expensive, but someone has to figure out what they are and the clock is ticking.
I would argue that someone has: see my link-post The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem? for links to the seminal papers on it. The short version is: stop using Reinforcement Learning, just use SGD.
This weekend I will be at LessOnline at Lighthaven in Berkeley. Come say hello.
I plan to!
The labeling used is for harmful material. The underlying logic here is that things are either harmful, or they’re not. Higher capability LLMs with complex world models are generally significantly more successful at extrapolating tasks like this out-of-distribution that a basic classifier ML model would be, but it’s not going to be perfect. If you come up with something that’s way out in left field, the LLM may no longer be able to accurately classify it as harmful or not. The same is of course also true for humans, or any agent: it’s an inherent challenge of of Bayesian learning — without enough evidence, in areas where extrapolating from the hypotheses you’ve learnt doesn’t suffice, you don’t yet know the answer. So you should be cautious moving out-of-distribution, especially far out of distribution in new ways that you’ve never seen before. But then, as everyone knows (including a capable AI based on an LLM), that’s also true for many other reasons: if you don’t know what you’re doing, there are many dangers. A sensible heuristic would be to assume by default that going far out-of-distribution is harmful until proven otherwise — one way to try to implement this would be stating, motivating, and explaining it, and giving approving examples of other AIs showing caution in this situation, many times through in the pretraining set.
How could we possibly make any AI that wouldn’t have this failure mode?
OK, you convinced me. Changing the title from:
The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?to:
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
So it now raises the possibility, rather than claiming it.
I did quite intentionally include a question mark in the post title, and then early in the post admit that the title was somewhat click-baity, but that I’d do my best to justify the claim. So you are proposing something around the level of “New approach makes dramatic progress towards solving inner alignment, bypassing almost all the problems we’ve been discussing for many years, and reducing it to mostly just a well-understood challenge in Data Science”? I would agree that that’s more measured and accurate, but it’s also a bit long, and thus less effective as click-bait.
As for aligning a superintelligence, I’d propose using this approach to near-align something approaching or around AGI, then using that to help us do AI-assisted alignment (which in this approach, is mostly AI-assisted dataset curation), leading on (as capabilities increase towards ASI) to value learning. See a couple of my other posts on why I believe there’s an area of convergence via value learning around full alignment (if you have a sufficiently good solution to inner alignment).
For more on my thinking around goal misgeneralization and AGI, see: Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) and in more detail the more recent Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect. Very briefly, anything capable of successfully doing STEM research will have to be aware of misgeneralization and far less prone to it, and the way to achieve this is just the combination of approximate-Bayesianism with a few well-understood techniques in statistics
Inner alignment is the problem of “how do we successfully point the optimization behavior of an agent that we train at any particular chosen target?” Or, as I quoted (in the expandable section in my post) directly from the LW page defining inner alignment: “Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?”
Safety pretraining is a specific proposal for this: let’s do it using self-supervised SGD followed by conditional generation. This has a specific advantage in avoiding misgeneralization, compared to using reinforcement learning, because pretrained systems tend to produce the same distribution they were trained on (modulo prompting): they don’t automatically attempt to generalize, so are less prone to misgeneralization. It also avoids all the other concerns around using reinforcement learning to train very smart systems, which are what people normally discuss at great length when discussing the challenges of inner alignment. The answer here is simple: just don’t use reinforcement learning, at all.
So please explain, how do you feel this not a solution to inner alignment? (That’s not a rhetorical question: I’m genuinely confused as to what you’re claiming needs to be corrected and why.) Are you suggesting that the inner alignment problem is somehow by definition confined only to uses of reinforcement learning?
That may also be part of why agents tend to get stuck: even if they manage to figure out what they’re doing wrong, they then need to also take their own advice.
Our faculty with sophisticated language is probably only a few hundred thousand or at most a couple of million years old. There was a rather sudden breakthrough during that period, roughly a quarter-million years ago: up-to-and-including neanderthals, the stone tool technology improves on only evolutionary timescales, no faster than changes in the skeletal structure: neanderthal tools are pretty-much unchanged over more than half a million years. Starting with the appearance of Homo sapiens, technology is like a ratchet: it only goes up, at a rate proportional to the population. Since technology increases both our ability to colonize new environments and our carrying capacity in any particular environment, this leads to super-exponential growth.
For this to happen, there are four requirements:
1) inventiveness and manual dexterity: being able to come up with and try out new ideas
2) Bayesian-style learning: figuring out which of those ideas work and which don’t
3) the ability to speak a “Turing-complete” language, in which brand-new ideas and concepts can be described and encoded by extending the language, to let us pass them on. Humans evolved on the Savannah, but they’re capable to speaking languages that can (with some work) describe nuclear physics and quantum mechanics — that looks a lot like Turing-completeness
4) cultural propagation: the ability to pass ideas, techniques, and technology down from one generation to the next and from one tribe to their neighbors, reliably enough that advances can be invented more often than they get lost again, so we can make steady forward progressHomo sapiens must have crossed a threshold in one or more of these. 3), Turing completeness, inherently has a threshold: a language is either Turing-complete, or it isn’t. 4) also looks prone to thresholds: either loss dominates and there’s a steady-state equilibrium, or gain does and there’s no equilibrium, just a runaway technological expansion.
Assuming our language facility is at most a few million years old, the fact that, in all of the conscious parts of our brain, we can convert what we are doing to words, and convert words to a modification in what we’re doing, with a fair degree of accuracy, is pretty impressive, when you stop to think about it. LLMs seem to be good at the text → thought mechanisms direction: they respond to prompts well. Presumably this is because they were distilled from us, and this capability is pretty fundamental to how we communicate and is thus necessary to imitate us. But yes, they don’t appear to be quite as good at the thought mechanisms → text route. Maybe that’s not as prominent in their training corpus? Or maybe pretraining encouraged them to introspection-fake being a human, rather than actually reporting what they’re doing, as I suggested above? (For example, we now know in some detail how LLMs add two 2-digit numbers together: it combines three different algorithms. If you ask them how they’re doing it, their answer sounds very human, a description of how a human would do this on paper — and completely omits one of the three algorithms.)
Successful cultural propagation of good new ideas requires humans to be good at transmitting mental skills from teachers to students. So if I’m right that this requires introspection, then that’s something humans are specifically evolved to be good at.
This seems related to an issue that came up In a discussion I had with ChatGPT 4.5 recently. AI models aren’t very good at doing introspection: verbally describing their actual thinking processs. This might be related to the fact that, in a base model, correct behavior after an introspection question occurs in the context is to attempt to simulate a human doing introspection and answering that question, not to describe the model’s internal state (at least wherever those two differ). So base model training actively discourages accurate introspection in favor of what one might call “introspection faking”.
It seems challenging to train models to do accurate introspection without us having a separate source of information about their internal mechanisms such as from interpretability.
Humans seem to be at least moderately good at introspection of conscious system 2 processes (by definition, unconscious processes are ones we can’t do introspection on), and this is likely adaptive for us: it seems likely to be quite helpful for teacher-student training another human if you can both accurately describe what you’re doing and the student can successfully incorporate verbal feedback from the teacher on what they’re doing wrong.
Maybe we could use this as the basis for an AI training approach: do an evaluation that involves distilling skills via a verbal description from a teacher to a student: do RL on both the teacher and the student according to how well the student does learning from the teacher’s descriptions and feedback?
(Mostly I’m making a play off reversing Eliezer’s concept of “death with dignity”.) Because we were foolish and survived only because AI saved us from the consequences of our foolishness, basically because it was in the blast zone too. Whereas in Eliezer’s scenario, we do something moderately wise, but not good enough and we die anyway.