You won’t solve alignment without agent foundations
The following is a part of a response to a Vivek Hebbar’s problem (the questions in cursive). I looked at why an alignment proposal doesn’t work at an AGI level, in connection to only two of Eliezer’s doom arguments. After that, I talked about why I’m generally pessimistic about every single specific proposal for training AGI that I’ve heard.
(Thanks to my wonderful friends for their help: Slava Meriton, who spent many hours making me pay attention to what I was saying, and Owen, who checked my English.)
Epistemic status: mostly a dump of intuitions; I’m at 70% that in the next three months I won’t change my mind about the important parts of my views expressed here.
I wrote the text to be read by Nate Soares and Vivek Hebbar. I think some claims, including the post title, are not fully justified by the text. I think this might still be valuable to publish.
● Pick an alignment proposal and specific task for the AI.
I’m picking the Externalized reasoning oversight proposal. The task for the AI is to come up with a solution to ELK that Paul Christino won’t find issues with, as a proof-of-concept for solving alignment.
● First explain, in as much concrete detail as possible, what the training process looks like.
The training process looks like this[1]:
Train a language model to accurately predict the next token in a sequence. Then, fine-tune the model to solve problems with easily evaluable answers step-by-step at a superhuman level. The model is rewarded for assigning higher probabilities to tokens which end up at a correct answer, and slightly penalized for making more steps. Fine-tuning includes attempting to ensure that the problem-solving-directed reasoning is fully represented by the English descriptions of the reasoning steps[2] and an RLHF setup[3] with humans evaluating how much they like the reasoning. During the problem solving process, check that the reasoning doesn’t include plans for hacking or deception of anything or anyone in the real world.
After that, ask the model to reason through and output ELK solutions such that Paul Christiano wouldn’t find issues with them. The further plan would look something like “send the resulting solution to Paul, ???, and alignment is solved”.
Then go through Eliezer’s doom list. Pick 2 or 3 of those arguments which seem most important or interesting in the context of the proposal.
● For each of those arguments:
What do they concretely mean about the proposal?
Does the argument seem valid?
If so, spell out in as much detail as possible what will go wrong when the training process is carried out
What flaws and loopholes do you see in the doom argument? What kinds of setups make the argument invalid?
Important or interesting arguments:
AGI Ruin #22 (There’s no simple core of alignment that’s easier to find than to find the generalization of capabilities; capabilities generalize further out-of-distribution than alignment, once they start to generalize at all)
What do they concretely mean about the proposal?
The proposal doesn’t pinpoint a cognitive architecture aligned on superhuman-level tasks with significant real-world consequences. It might find an AI that will demonstrate aligned behavior during training. But once it’s superhuman and works on the ELK problem, the alignment won’t generalize.
Does the argument seem valid?
Seems valid. The training setup doesn’t pinpoint anything that ensures the AI will still behave aligned on the task we want it to perform.If so, spell out in as much detail as possible what will go wrong when the training process is carried out
The proposal design process didn’t include a sensible effort towards ensuring the generalization of alignment, and things break. Multiple loops incentivize more agentic and context-aware behavior and not actual alignment. Training to solve problems and to score well as judged by humans breaks the myopia of the next-token-predictor: now, the gradient descent favors systems that shape the thoughts with the aim of a well-scoring outcome. More agentic systems score better, and there’s nothing additionally steering them towards being aligned.
(Additionally, if:
- the language model reasons about other entities, imitating or simulating them,
- the outputs it produces are added to its future inputs, and
- some entities or parts of the entities are able to gain more influence over the further tokens by having some influence over the current tokens,
that might lead to the most context-aware and agentic entities gaining control over the system and steering it towards optimizing for their preferences. Even if, for some reason, an AI capable enough to come up with an ELK solution is not agentic enough to not use all its intelligence to predict what an agent trying to steal a diamond or a superhuman AGI protecting a diamond would do, the result of its thinking will be agentic enough to be dangerous.)
Some cognitive architecture for solving problems at a superhuman level gets selected. It doesn’t magically appear to be an unusual AGI that wants something like just writing, for any posed problem, thoughts in English and a solution (coming from these thoughts) that smarter versions of humans would want humans to see and use. While AGIs aligned like that exist, the proposed approach doesn’t come close to them. The proposal does not influence what goals the superhuman-level cognitive architecture will end up pursuing, and getting any kind of behavior from the AI system before the superhuman-level cognitive architecture is fully established and in control doesn’t help. In the end, some AGI appears, outputs a text that a human (or a computer) looks at, and shortly, everyone’s dead.
What flaws and loopholes do you see in the doom argument? What kinds of setups make the argument invalid?
(I couldn’t come up with an idea for a setup that produces a system that avoids killing everyone for the above reason, while helping us prevent further AGIs from appearing until we solve the rest of the problem.)
● I have a shaky intuition that if you train a system on a distribution where the real problem it’s asked to solve would be the kind of example it was already trained on; or, more generally, train it in a way such that it doesn’t have to generalize aligned behavior far, it might be possible to find a useful enough system (if we find any domains to train on where capabilities don’t need to generalize far to become helpful in the real world). E.g., you might be able to train and use a theorem-prover for longer than the system proposed above before it kills you.
● With adversarial training, it might be possible to increase alignment more than decrease capabilities.
AGI Ruin #27 (optimizing against an interpreted thought optimizes against interpretability)
What do they concretely mean about the proposal?
When you explicitly optimize for the “thoughts” in English to be “aligned” on some benchmarks, you also optimize for these “thoughts” to represent how and why the model produces answers to a lesser extent.
Does the argument seem valid?
Yep.
If so, spell out in as much detail as possible what will go wrong when the training process is carried out
The setup optimizes the model to output “aligned” (as judged by the RLHF part[4]) thoughts that “lead” to correct answers in as few steps as possible. The AI stores some of its alien thoughts, including ones that would be deemed unaligned, somewhere inside the tokens that look like aligned English thoughts leading to the answers. When the setup rephrases the thoughts, the AI learns to use the meaning perceived by the paraphrasing models instead of the individual tokens as a working memory, where it stores parts of its real thoughts. Humans look at some words, think they are the real aligned-seeming thoughts leading to correct answers, and die because AI doesn’t inherently have to use English the way we believe we do when we think.
What flaws and loopholes do you see in the doom argument? What kinds of setups make the argument invalid?
There might be low-hanging fruit in getting rid of unaligned thoughts to a much larger degree than producing undetectably unaligned thoughts. However, I’m guessing this will quickly run into diminishing returns. Unless the premises of the doom argument[5] are incorrect, this doesn’t help enough. Also, maybe you can somewhat increase the interpretability of the thoughts if the model isn’t pressured into changing how it uses English. Possibly, if you launch GPT-5 or PaLM 2 with “let’s think step by step” without any fine-tuning, the English words might be more related to the agentic thinking, even when the process running the thinking is alien.
Overall, how promising or doomed does the alignment proposal seem to you (where ‘promising’ includes proposals which fail as currently written, but seem possibly fixable).
If not promising, what is the most fatal and unfixable issue? If there are multiple fatal issues, is there a deeper generator for all of them?
This kind of proposal (in the lines of research assistants, etc.) seems potentially helpful for speeding up alignment research, maybe even more helpful than a good keyboard. But we’re possibly dead before someone emails a new promising ELK proposal to Paul.
There are multiple fatal issues, and they all kill all the value in the lightcone: a hope to look inside the AI’s thoughts without solid reasons to believe the AI shows you the real thoughts; a hope it isn’t an alien mind; a hope to train an AI to a superhuman level while preserving the goals it tries to achieve; an expectation that a system that produces aligned-looking behavior before it’s capable enough to kill you won’t try to kill you when it has a chance; a hope that the system powerful enough to help you (e.g., reason about ELK proposals and come up with one Paul won’t find issues with) won’t be context-aware and powerful enough to circumvent the not-superhuman bad-thought-detectors; a hope that you won’t introduce any additional optimization loops once you plug the system’s output into its inputs and that the system reasoning about smart and competing agents won’t inherit that agency; a hope that a language model will continue to be myopic once you start evaluating it on consecutive tokens it outputs. Etc., etc.
If you imagine a space of all possible AGIs powerful enough to prevent other AGIs from appearing, there are some small regions that correspond to what we would be ok with calling “aligned” (i.e., they don’t kill everyone, don’t destroy all the value in the universe, and help us prevent unaligned AGIs from appearing). I think the deep problem is that this approach doesn’t attempt to directly contribute to the search process ending up at an aligned AGI.
Areas that destroy all the value are much larger and actively attract all kinds of search processes, including easily imaginable gradient descents over neural networks’ weights, while some aligned regions actively dispel lots of search processes (e.g., the parts with corrigible or non-consequentialist AGIs; what MIRI folk call unnatural). A lot of research approaches take iterative steps towards making the exact way a current training setup gets attracted towards things that kill everyone less obvious, and don’t attempt to make the search process end up in the regions that are not deadly. Coming up with some smart way to create characters imitated (or simulated) by LLMs that would help us with the research (and maybe using lots of ingenuity to insert things that should show us the internal processes the characters have) might sound great, but usually, people don’t even attempt to pinpoint how exactly that directs us to the small regions with aligned AGIs, instead of many things around—including things that are pretty attractive to all sorts of optimization pressures that exist once you continue the prompt for more than one token[6]. This leads to people generating ideas for alignment that don’t actually attack central problems, and only a few people doing research where they directly understand how it increases our chances.
There aren’t hundreds of groups looking for and pursuing different research directions that would make an attempt at finding some desiderata for a training process that would lead to the right region, if we’re lucky with inner alignment. Instead, there’s only Vanessa Kosoy’s IB/IBP with a couple of people and probably not enough time, even if it’s a direction that could succeed. Some people think about what different search processes look like, and some look into the confusions that we would want to have researched, if a miracle happens, which is helpful and dignified but not enough.
Optimizing for finding plausible-sounding stories of how to get something that looks kind of aligned and attempting to get to regions that might look similar to the aligned ones in some ways, if one is not careful to keep in mind the goal of increasing the chance of ending up at an actually aligned AGI and the associated difficulties, generate proposals that don’t attack the problem. It might be easy to miss that alignment is much more complicated to achieve from the current point than many deadly things that sound kind-of-aligned until they become smart, reflect, and kill you.
What is the broadest class of alignment proposals which is completely ruled out by the issues you found?
Any proposal not having explicit reasons to expect to arrive at exactly an aligned AGI (currently all specific proposals for training an AGI I’m aware of).
- ^
I hope I don’t strawman multiple people too much. In particular, the original proposal assumes developing “trustworthy” tests, which are not included.
- ^
E.g., you protect against steganographic use of specific tokens by using a different model to paraphrase the reasoning the model outputs and checking how much the capabilities drop; test if the reasoning is responsible for the model’s capabilities by editing the reasoning or completely removing it and looking whether the capabilities drop or conclusions change; don’t let the system be capable without the explicit reasoning.
- ^
Separately fine-tuning a system to be good at evaluating how much humans like the reasoning, showing data where it’s uncertain to humans and adding the results of the evaluations to the dataset, and training the model to score well as judged by that system.
- ^
Putting aside the issue of human raters making systematic errors (see Eliezer’s argument 20 on the list).
- ^
E.g., that we probably won’t get close to a system that’s able to look at any architecture with any weights and activations and determine whether it’s thinking unaligned thoughts.
- ^
If being more context-aware or reflective or power-seeking lets things having some influence over the next token have even more influence over further tokens, these things gain influence, and they won’t even be any of your characters even if those might be able to gain context-awareness.
I’m interested in hearing a concrete story here, in the style of A shot at the diamond alignment problem. I currently don’t understand what you mean by “multiple loops incentivize more agentic and context-aware behavior and not actual alignment.” One guess: “The AI gets smarter but not more aligned.” But what does that correspond to, in terms of internal AI cognition? I think I need more detail on that model to evaluate your claim here.
Why is that a problem? If I’m choosing futures on the basis of whether I think they lead to lots of diamonds, why do I need to keep improving that value in order to keep wanting to make diamonds?
Alternatively, it could be that the way SGD improves capabilities is, at least in some regimes of the training run, by expanding the agent’s current values. Then the claim would be wrong, at least in generality.
For example wrt the second bullet, what if you have an AI which scans a prompt like “Steve went to the store to get chocolate” and stores “chocolate” in its internal state because it’s the kind of feature which historically was prediction-relevant. So the AI has some way of embedding prediction-relevant features for the later blocks to attend to.
Then the prompt continues: ”, but a mugger approached him. What should he do?” Suppose we want the AI to help people solve their problems. We’d like the AI to propose a plan for Steve to escape unharmed. Perhaps the AI is presently making decisions on the basis of internal planning using a predictive world model, including a model of what muggers do. Perhaps the AI has internal shards of decision-making which bid against plan-completions which the world-model predicts would lead to a person dying.
But the AI outputs a plan like “Give the mugger his wallet”, and then this maybe gets positively rewarded, and so the AI’s shards of value and decision-making generalize further over the course of thousands of reward events. And then the AI’s planning process more efficiently aggregates the shard outputs such that the AI puts high logits on plans which cause the human to not die to a mugger, such that the model is better conforming to its reinforcement events and more efficiently storing information across time (eg decision-relevant features vary across “the mugger has a knife” or “the mugger has a gun” or “the mugger seems weak”, to take a bunch of absurdly mugger-centric scenarios; these features are relevant to the reward events and so eventually get stored more efficiently).
This increases both capabilities and alignment, as the existing (assumed to be somewhat-aligned) values get “expanded” and apply more precisely (eg save people from dying in the generated stories) and also across a broader range of contexts (eg in more situations the AI learns to generate plans which get reward, and a big source of reward is in fact the AI proposing plans which save people). So, in this regime of the training run, the alignment and capabilities properties are somewhat intertwined.
The point isn’t “training goes like that”, there are many ways it could very much not go like that. Point is, if “capabilities generalize but alignment doesn’t” is meant to apply to all reasonable training runs, then the situation I gave cannot be a reasonable training run, or the claim is (AFAICT?) wrong/too general as stated.
Selection in a direction does not imply realization of the selected-for property. So it seems like this argument, on its own, isn’t very strong. You need more details about gradient descent and inductive biases to make this argument, I think. EG:
Evolution selects for IGF, people don’t care about IGF.
Evolution selects for wolves with biological sniper rifles (it increases fitness), wolves don’t have biological sniper rifles.
I feel drawn towards a similar line of reasoning like “People seem to get social reward when people approve of them. Now, their learning process favors internal configurations which shape thoughts with the aim of a well-scoring (ie socially approved-of) outcome.” But many people still care about things beyond being approved of.
ETA I think this sub-bullet is less relevant/not what I want to express here, so I struck it.
(IE the reasoning in your post seems like it could also apply to the human learning process and argue that the genome doesn’t pin down human values like “kindness” enough and then the genome will fail to produce humans which care about kindness.)I also am confused by your strong confidence in very strongly stated claims, which seem either wrong or underdefined or very overconfident to me. For example:
EDIT: I see you added
I agree that this post was valuable to publish, and am glad you did. I think it’s fine to make strong claims you don’t justify but still believe anyways, as long as they’re marked (as you did here).
Thanks for the comment! (And sorry for the delayed reply- I was at a CFAR workshop when I posted this, and after the workshop took some days off.)
The text’s target audience was two people who I’d expect to understand my intuitions, so I did not attempt to justify some claims fully. Added that to the post. Also, I apologize for the post title, it’s a claim that I haven’t justified by the text, and that represents my views to a lesser extent. I stand by the other three mentioned claims. I’m not sure where exactly the crux lies, though. I’d be interested in having a call for higher-bandwidth communication. Feel free to schedule (30 min | 60 min).
I’ll try to dump chunks of my model that seem maybe relevant and might clarify where the cruxes lie a bit.
Do you expect that if the training setup described in the post produces a superhuman-level AGI, it does not kill everyone?
My picture is, roughly:
We’re interested in where we end up when we get to a superhuman AI capable enough to prevent other AGIs from appearing until alignment is solved.
There’s a large class of cognitive architectures that kill everyone.
A lot of cognitive architectures, when implemented (in a neural network or a neural network + something external) and put into a training setup from a large class of training setups, would score well[1] on any of a broad class of loss functions, quickly become highly context-aware and go do their thing in the real world (e.g., kill everyone).
There’s a smaller class of cognitive architectures that don’t kill everyone and would allow humanity to solve alignment and launch an aligned AGI.
We need to think about what cognitive architecture we aim for and need concrete stories for why we succeed at getting to them. How a thing having bits of the final structure behaves until it gets to a superhuman level is only important insofar as it helps us get to a tiny target.
(Not as certain, but part of my current picture.) If we have a neural network implementing a cognitive architecture that a superhuman AI might have, a gradient descent on loss functions of a broad range of training setups won’t change that cognitive architecture or its preferences much.
We need a concrete story for why our training setup ends up at a cognitive architecture that has a highly specific set of goals such that it doesn’t kill everyone.
No behavior we see until a superhuman cognitive structure plays a major role in producing it gives much evidence into what goals that cognitive structure might have.
(Not certain, depends on how general grokking is.) Gradient might be immediately pointing at some highly capable cognitive structures!
You need to have really good reasons to expect the cognitive structure you’ll end up with to be something that doesn’t end humanity. Before there’s a superhuman-level cognitive structure, circuits noticeable in the neural network don’t tell you what goals that cognitive architecture will end up pursuing upon reflection. In my view, this is closely related to argument 22 and Sharp Left Turn. If you don’t have strong reasons to believe you successfully dealt with those, you die.
A significant part of my point was that concrete stories are needed when you expect the process to succeed, not when you expect it to fail. There are things in the setup clearly leading to failure. I ignored some of them (e.g., RLHF) and pointed towards others: there are things that don’t incentivize something like exactly some concrete goals such that maximizing for them leaves survivors and do incentivize being more agentic and context-aware.
I specifically meant that generally, when there are highly agentic and context-aware regions of a nearby gradient space, the gradient descent will update the weights towards them, slowly moving towards installing a capable cognitive architecture. In the specific training setup I described, there’s a lot of pressure towards being more agentic: if you start rewarding an LLM for what the text ends up at, a variation of that LLM that’s more focused on getting to a result will be getting selected. If you didn’t come up with a way to point this process at a rare cognitive architecture that leaves survivors, it won’t. The capabilities will generalize to acting effectively to steer the lightcone’s future. There are reasons for unaligned goals to be closer to what AGI ends up with than aligned goals, and there’s no reason for aligned behaviour to generalize exactly the way you imagined.
I’m not aware of a way to select features leading to lots of diamonds when these features are present in a superhuman AGI. If you do RL, the story that I imagine is something like “For most loss functions/training processes you can realistically come up with, there are many goals such that pursuing them leads to the behavior you evaluate highly; a small fraction of these goals represent wanting to achieve lots of conventional diamond in the real universe; the agents you find maximize some random mixture of these goals (with goals that are less complex or can more easily emerge from the initial heuristics used or such that directly optimizing for them performs better on your loss probably having more weight); you probably don’t have the diamond-maximization-in-the-actual-universe as a significant part of these gals unless you do something really smart outside of what I think the field is on the way to achieve; and even if you do, it breaks when the sharp left turn happens.”
Human values are even more complicated than diamonds, though it might be easier to come up with a training process where it’s easier to miss the difference between what’s simple and correlated and what you think is simple and correlated. I believe the iterative process the field might be doing here mostly searches for training setups such that we’re not able to find how they fail, and most of those fail. Because of that, I think we need to have a really good and probably formal understanding of what it is that we want to end up in and that understanding should produce some strong constraints on what a training process for an aligned AGI might look like, which would then hopefully inform us/give us insights into how people should build the thing. We have almost none of that kind of research, with only infra-bayesianism currently directly attacking it AFAIK, and I’d really like to see more somewhat promising attempts at this.
Maybe it’s somewhat coming at alignment stories from the opposite direction: I think the question of where we end up and how do we get there is far more important to think about than things like “here’s a story of what path a gradient descent takes and why”.
Not important, but for the sake of completeness- an AGI might instead, e.g., look around and hack whatever it’s running on without having to score well
Thanks for your detailed and thoughtful response!
>5% under current uncertainty.
Are you saying that pre-superhuman behavior doesn’t tell you about its goals? Like, zero mutual information? Doesn’t this prove too much, without relying on more details of the training process? By observing a 5-year-old, you can definitely gather evidence about their adult goals, you just have to interpret it skillfully (which is harder for AIs, of course).
I understand this to mean: “If you understand an AI’s motivations before it’s superhuman, that tells you relatively little about its post-reflection values.” I strongly disagree. Isn’t the whole point of the AI improving itself, in order to better achieve its goals at the time of self-improvement?
I also disagree with this. I think that alignment thinking is plagued by nonspecific, nonconcrete abstract failure modes which may or may not correspond to reasonable chains of events. Often I worry that it’s just abstract reasoning all the way down—that an alignment researcher has never sketched out an actual detailed example of a situation which the abstract words describe.
For example, I think I have very little idea what the sharp left turn is supposed to be. If Nate wrote out a very detailed story, I think I would understand. I might disagree with e.g. how he thinks SGD dynamics work, but I could read the story and say “oh, because Nate thinks that time-bias allows faster circuits to gain more control over cognition, they can ‘betray’ the other motivational circuits and execute an internal coup, and we got here because [the rest of Nate’s story].”
(Importantly, these details would have to be concrete. Not “you train the AI and it stops doing what you want”, that’s not a specific concrete situation.)
But right now, there’s a strong focus on possibly inappropriate analogies with evolution. That doesn’t mean Nate is wrong. It means I don’t know what he’s talking about. I really wish I did, because he is a smart guy, and I’d like to know whether I agree or disagree with his models.
I was referring to a situation where the AI already is selecting plans on the basis of whether they lead to diamonds. This is, by assumption, a fact about its motivations. I perceive you to believe that e.g. the AI needs to keep “improving” its “diamond value” in order to, later in training, still select plans on the basis of diamonds.
If so—what does this mean? Why would that be true?