The proposal design process didn’t include a sensible effort towards ensuring the generalization of alignment, and things break. Multiple loops incentivize more agentic and context-aware behavior and not actual alignment.
I’m interested in hearing a concrete story here, in the style of A shot at the diamond alignment problem. I currently don’t understand what you mean by “multiple loops incentivize more agentic and context-aware behavior and not actual alignment.” One guess: “The AI gets smarter but not more aligned.” But what does that correspond to, in terms of internal AI cognition? I think I need more detail on that model to evaluate your claim here.
Why is that a problem? If I’m choosing futures on the basis of whether I think they lead to lots of diamonds, why do I need to keep improving that value in order to keep wanting to make diamonds?
Alternatively, it could be that the way SGD improves capabilities is, at least in some regimes of the training run, by expanding the agent’s current values. Then the claim would be wrong, at least in generality.
For example wrt the second bullet, what if you have an AI which scans a prompt like “Steve went to the store to get chocolate” and stores “chocolate” in its internal state because it’s the kind of feature which historically was prediction-relevant. So the AI has some way of embedding prediction-relevant features for the later blocks to attend to.
Then the prompt continues: ”, but a mugger approached him. What should he do?” Suppose we want the AI to help people solve their problems. We’d like the AI to propose a plan for Steve to escape unharmed. Perhaps the AI is presently making decisions on the basis of internal planning using a predictive world model, including a model of what muggers do. Perhaps the AI has internal shards of decision-making which bid against plan-completions which the world-model predicts would lead to a person dying.
But the AI outputs a plan like “Give the mugger his wallet”, and then this maybe gets positively rewarded, and so the AI’s shards of value and decision-making generalize further over the course of thousands of reward events. And then the AI’s planning process more efficiently aggregates the shard outputs such that the AI puts high logits on plans which cause the human to not die to a mugger, such that the model is better conforming to its reinforcement events and more efficiently storing information across time (eg decision-relevant features vary across “the mugger has a knife” or “the mugger has a gun” or “the mugger seems weak”, to take a bunch of absurdly mugger-centric scenarios; these features are relevant to the reward events and so eventually get stored more efficiently).
This increases both capabilities and alignment, as the existing (assumed to be somewhat-aligned) values get “expanded” and apply more precisely (eg save people from dying in the generated stories) and also across a broader range of contexts (eg in more situations the AI learns to generate plans which get reward, and a big source of reward is in fact the AI proposing plans which save people). So, in this regime of the training run, the alignment and capabilities properties are somewhat intertwined.
The point isn’t “training goes like that”, there are many ways it could very much not go like that. Point is, if “capabilities generalize but alignment doesn’t” is meant to apply to all reasonable training runs, then the situation I gave cannot be a reasonable training run, or the claim is (AFAICT?) wrong/too general as stated.
Training to solve problems and to score well as judged by humans breaks the myopia of the next-token-predictor: now, the gradient descent favors systems that shape the thoughts with the aim of a well-scoring outcome.
Selection in a direction does not imply realization of the selected-for property. So it seems like this argument, on its own, isn’t very strong. You need more details about gradient descent and inductive biases to make this argument, I think. EG:
Evolution selects for IGF, people don’t care about IGF.
Evolution selects for wolves with biological sniper rifles (it increases fitness), wolves don’t have biological sniper rifles.
I feel drawn towards a similar line of reasoning like “People seem to get social reward when people approve of them. Now, their learning process favors internal configurations which shape thoughts with the aim of a well-scoring (ie socially approved-of) outcome.” But many people still care about things beyond being approved of.
ETA I think this sub-bullet is less relevant/not what I want to express here, so I struck it.(IE the reasoning in your post seems like it could also apply to the human learning process and argue that the genome doesn’t pin down human values like “kindness” enough and then the genome will fail to produce humans which care about kindness.)
I also am confused by your strong confidence in very strongly stated claims, which seem either wrong or underdefined or very overconfident to me. For example:
You won’t solve alignment without agent foundations
More agentic systems score better, and there’s nothing additionally steering them towards being aligned.
The proposal does not influence what goals the superhuman-level cognitive architecture will end up pursuing
There are multiple fatal issues, and they all kill all the value in the lightcone
EDIT: I see you added
I wrote the text to be read by Nate Soares and Vivek Hebbar. I think some claims, including the post title, are not fully justified by the text. I think this might still be valuable to publish.
I agree that this post was valuable to publish, and am glad you did. I think it’s fine to make strong claims you don’t justify but still believe anyways, as long as they’re marked (as you did here).
Thanks for the comment! (And sorry for the delayed reply- I was at a CFAR workshop when I posted this, and after the workshop took some days off.)
The text’s target audience was two people who I’d expect to understand my intuitions, so I did not attempt to justify some claims fully. Added that to the post. Also, I apologize for the post title, it’s a claim that I haven’t justified by the text, and that represents my views to a lesser extent. I stand by the other three mentioned claims. I’m not sure where exactly the crux lies, though. I’d be interested in having a call for higher-bandwidth communication. Feel free to schedule (30 min | 60 min).
I’ll try to dump chunks of my model that seem maybe relevant and might clarify where the cruxes lie a bit.
Do you expect that if the training setup described in the post produces a superhuman-level AGI, it does not kill everyone?
My picture is, roughly:
We’re interested in where we end up when we get to a superhuman AI capable enough to prevent other AGIs from appearing until alignment is solved.
There’s a large class of cognitive architectures that kill everyone.
A lot of cognitive architectures, when implemented (in a neural network or a neural network + something external) and put into a training setup from a large class of training setups, would score well[1] on any of a broad class of loss functions, quickly become highly context-aware and go do their thing in the real world (e.g., kill everyone).
There’s a smaller class of cognitive architectures that don’t kill everyone and would allow humanity to solve alignment and launch an aligned AGI.
We need to think about what cognitive architecture we aim for and need concrete stories for why we succeed at getting to them. How a thing having bits of the final structure behaves until it gets to a superhuman level is only important insofar as it helps us get to a tiny target.
(Not as certain, but part of my current picture.) If we have a neural network implementing a cognitive architecture that a superhuman AI might have, a gradient descent on loss functions of a broad range of training setups won’t change that cognitive architecture or its preferences much.
We need a concrete story for why our training setup ends up at a cognitive architecture that has a highly specific set of goals such that it doesn’t kill everyone.
No behavior we see until a superhuman cognitive structure plays a major role in producing it gives much evidence into what goals that cognitive structure might have.
(Not certain, depends on how general grokking is.) Gradient might be immediately pointing at some highly capable cognitive structures!
You need to have really good reasons to expect the cognitive structure you’ll end up with to be something that doesn’t end humanity. Before there’s a superhuman-level cognitive structure, circuits noticeable in the neural network don’t tell you what goals that cognitive architecture will end up pursuing upon reflection. In my view, this is closely related to argument 22 and Sharp Left Turn. If you don’t have strong reasons to believe you successfully dealt with those, you die.
I’m interested in hearing a concrete story here
A significant part of my point was that concrete stories are needed when you expect the process to succeed, not when you expect it to fail. There are things in the setup clearly leading to failure. I ignored some of them (e.g., RLHF) and pointed towards others: there are things that don’t incentivize something like exactly some concrete goals such that maximizing for them leaves survivors and do incentivize being more agentic and context-aware.
I specifically meant that generally, when there are highly agentic and context-aware regions of a nearby gradient space, the gradient descent will update the weights towards them, slowly moving towards installing a capable cognitive architecture. In the specific training setup I described, there’s a lot of pressure towards being more agentic: if you start rewarding an LLM for what the text ends up at, a variation of that LLM that’s more focused on getting to a result will be getting selected. If you didn’t come up with a way to point this process at a rare cognitive architecture that leaves survivors, it won’t. The capabilities will generalize to acting effectively to steer the lightcone’s future. There are reasons for unaligned goals to be closer to what AGI ends up with than aligned goals, and there’s no reason for aligned behaviour to generalize exactly the way you imagined.
If I’m choosing futures on the basis of whether I think they lead to lots of diamonds, why do I need to keep improving that value in order to keep wanting to make diamonds?
I’m not aware of a way to select features leading to lots of diamonds when these features are present in a superhuman AGI. If you do RL, the story that I imagine is something like “For most loss functions/training processes you can realistically come up with, there are many goals such that pursuing them leads to the behavior you evaluate highly; a small fraction of these goals represent wanting to achieve lots of conventional diamond in the real universe; the agents you find maximize some random mixture of these goals (with goals that are less complex or can more easily emerge from the initial heuristics used or such that directly optimizing for them performs better on your loss probably having more weight); you probably don’t have the diamond-maximization-in-the-actual-universe as a significant part of these gals unless you do something really smart outside of what I think the field is on the way to achieve; and even if you do, it breaks when the sharp left turn happens.”
Human values are even more complicated than diamonds, though it might be easier to come up with a training process where it’s easier to miss the difference between what’s simple and correlated and what you think is simple and correlated. I believe the iterative process the field might be doing here mostly searches for training setups such that we’re not able to find how they fail, and most of those fail. Because of that, I think we need to have a really good and probably formal understanding of what it is that we want to end up in and that understanding should produce some strong constraints on what a training process for an aligned AGI might look like, which would then hopefully inform us/give us insights into how people should build the thing. We have almost none of that kind of research, with only infra-bayesianism currently directly attacking it AFAIK, and I’d really like to see more somewhat promising attempts at this.
Maybe it’s somewhat coming at alignment stories from the opposite direction: I think the question of where we end up and how do we get there is far more important to think about than things like “here’s a story of what path a gradient descent takes and why”.
Not important, but for the sake of completeness- an AGI might instead, e.g., look around and hack whatever it’s running on without having to score well
Do you expect that if the training setup described in the post produces a superhuman-level AGI, it does not kill everyone?
>5% under current uncertainty.
No behavior we see until a superhuman cognitive structure plays a major role in producing it gives much evidence into what goals that cognitive structure might have.
Are you saying that pre-superhuman behavior doesn’t tell you about its goals? Like, zero mutual information? Doesn’t this prove too much, without relying on more details of the training process? By observing a 5-year-old, you can definitely gather evidence about their adult goals, you just have to interpret it skillfully (which is harder for AIs, of course).
Before there’s a superhuman-level cognitive structure, circuits noticeable in the neural network don’t tell you what goals that cognitive architecture will end up pursuing upon reflection.
I understand this to mean: “If you understand an AI’s motivations before it’s superhuman, that tells you relatively little about its post-reflection values.” I strongly disagree. Isn’t the whole point of the AI improving itself, in order to better achieve its goals at the time of self-improvement?
A significant part of my point was that concrete stories are needed when you expect the process to succeed, not when you expect it to fail.
I also disagree with this. I think that alignment thinking is plagued by nonspecific, nonconcrete abstract failure modes which may or may not correspond to reasonable chains of events. Often I worry that it’s just abstract reasoning all the way down—that an alignment researcher has never sketched out an actual detailed example of a situation which the abstract words describe.
For example, I think I have very little idea what the sharp left turn is supposed to be. If Nate wrote out a very detailed story, I think I would understand. I might disagree with e.g. how he thinks SGD dynamics work, but I could read the story and say “oh, because Nate thinks that time-bias allows faster circuits to gain more control over cognition, they can ‘betray’ the other motivational circuits and execute an internal coup, and we got here because [the rest of Nate’s story].”
(Importantly, these details would have to be concrete. Not “you train the AI and it stops doing what you want”, that’s not a specific concrete situation.)
But right now, there’s a strong focus on possibly inappropriate analogies with evolution. That doesn’t mean Nate is wrong. It means I don’t know what he’s talking about. I really wish I did, because he is a smart guy, and I’d like to know whether I agree or disagree with his models.
If I’m choosing futures on the basis of whether I think they lead to lots of diamonds, why do I need to keep improving that value in order to keep wanting to make diamonds?
I’m not aware of a way to select features leading to lots of diamonds when these features are present in a superhuman AGI.
I was referring to a situation where the AI already is selecting plans on the basis of whether they lead to diamonds. This is, by assumption, a fact about its motivations. I perceive you to believe that e.g. the AI needs to keep “improving” its “diamond value” in order to, later in training, still select plans on the basis of diamonds.
If so—what does this mean? Why would that be true?
I’m interested in hearing a concrete story here, in the style of A shot at the diamond alignment problem. I currently don’t understand what you mean by “multiple loops incentivize more agentic and context-aware behavior and not actual alignment.” One guess: “The AI gets smarter but not more aligned.” But what does that correspond to, in terms of internal AI cognition? I think I need more detail on that model to evaluate your claim here.
Why is that a problem? If I’m choosing futures on the basis of whether I think they lead to lots of diamonds, why do I need to keep improving that value in order to keep wanting to make diamonds?
Alternatively, it could be that the way SGD improves capabilities is, at least in some regimes of the training run, by expanding the agent’s current values. Then the claim would be wrong, at least in generality.
For example wrt the second bullet, what if you have an AI which scans a prompt like “Steve went to the store to get chocolate” and stores “chocolate” in its internal state because it’s the kind of feature which historically was prediction-relevant. So the AI has some way of embedding prediction-relevant features for the later blocks to attend to.
Then the prompt continues: ”, but a mugger approached him. What should he do?” Suppose we want the AI to help people solve their problems. We’d like the AI to propose a plan for Steve to escape unharmed. Perhaps the AI is presently making decisions on the basis of internal planning using a predictive world model, including a model of what muggers do. Perhaps the AI has internal shards of decision-making which bid against plan-completions which the world-model predicts would lead to a person dying.
But the AI outputs a plan like “Give the mugger his wallet”, and then this maybe gets positively rewarded, and so the AI’s shards of value and decision-making generalize further over the course of thousands of reward events. And then the AI’s planning process more efficiently aggregates the shard outputs such that the AI puts high logits on plans which cause the human to not die to a mugger, such that the model is better conforming to its reinforcement events and more efficiently storing information across time (eg decision-relevant features vary across “the mugger has a knife” or “the mugger has a gun” or “the mugger seems weak”, to take a bunch of absurdly mugger-centric scenarios; these features are relevant to the reward events and so eventually get stored more efficiently).
This increases both capabilities and alignment, as the existing (assumed to be somewhat-aligned) values get “expanded” and apply more precisely (eg save people from dying in the generated stories) and also across a broader range of contexts (eg in more situations the AI learns to generate plans which get reward, and a big source of reward is in fact the AI proposing plans which save people). So, in this regime of the training run, the alignment and capabilities properties are somewhat intertwined.
The point isn’t “training goes like that”, there are many ways it could very much not go like that. Point is, if “capabilities generalize but alignment doesn’t” is meant to apply to all reasonable training runs, then the situation I gave cannot be a reasonable training run, or the claim is (AFAICT?) wrong/too general as stated.
Selection in a direction does not imply realization of the selected-for property. So it seems like this argument, on its own, isn’t very strong. You need more details about gradient descent and inductive biases to make this argument, I think. EG:
Evolution selects for IGF, people don’t care about IGF.
Evolution selects for wolves with biological sniper rifles (it increases fitness), wolves don’t have biological sniper rifles.
I feel drawn towards a similar line of reasoning like “People seem to get social reward when people approve of them. Now, their learning process favors internal configurations which shape thoughts with the aim of a well-scoring (ie socially approved-of) outcome.” But many people still care about things beyond being approved of.
ETA I think this sub-bullet is less relevant/not what I want to express here, so I struck it.
(IE the reasoning in your post seems like it could also apply to the human learning process and argue that the genome doesn’t pin down human values like “kindness” enough and then the genome will fail to produce humans which care about kindness.)I also am confused by your strong confidence in very strongly stated claims, which seem either wrong or underdefined or very overconfident to me. For example:
EDIT: I see you added
I agree that this post was valuable to publish, and am glad you did. I think it’s fine to make strong claims you don’t justify but still believe anyways, as long as they’re marked (as you did here).
Thanks for the comment! (And sorry for the delayed reply- I was at a CFAR workshop when I posted this, and after the workshop took some days off.)
The text’s target audience was two people who I’d expect to understand my intuitions, so I did not attempt to justify some claims fully. Added that to the post. Also, I apologize for the post title, it’s a claim that I haven’t justified by the text, and that represents my views to a lesser extent. I stand by the other three mentioned claims. I’m not sure where exactly the crux lies, though. I’d be interested in having a call for higher-bandwidth communication. Feel free to schedule (30 min | 60 min).
I’ll try to dump chunks of my model that seem maybe relevant and might clarify where the cruxes lie a bit.
Do you expect that if the training setup described in the post produces a superhuman-level AGI, it does not kill everyone?
My picture is, roughly:
We’re interested in where we end up when we get to a superhuman AI capable enough to prevent other AGIs from appearing until alignment is solved.
There’s a large class of cognitive architectures that kill everyone.
A lot of cognitive architectures, when implemented (in a neural network or a neural network + something external) and put into a training setup from a large class of training setups, would score well[1] on any of a broad class of loss functions, quickly become highly context-aware and go do their thing in the real world (e.g., kill everyone).
There’s a smaller class of cognitive architectures that don’t kill everyone and would allow humanity to solve alignment and launch an aligned AGI.
We need to think about what cognitive architecture we aim for and need concrete stories for why we succeed at getting to them. How a thing having bits of the final structure behaves until it gets to a superhuman level is only important insofar as it helps us get to a tiny target.
(Not as certain, but part of my current picture.) If we have a neural network implementing a cognitive architecture that a superhuman AI might have, a gradient descent on loss functions of a broad range of training setups won’t change that cognitive architecture or its preferences much.
We need a concrete story for why our training setup ends up at a cognitive architecture that has a highly specific set of goals such that it doesn’t kill everyone.
No behavior we see until a superhuman cognitive structure plays a major role in producing it gives much evidence into what goals that cognitive structure might have.
(Not certain, depends on how general grokking is.) Gradient might be immediately pointing at some highly capable cognitive structures!
You need to have really good reasons to expect the cognitive structure you’ll end up with to be something that doesn’t end humanity. Before there’s a superhuman-level cognitive structure, circuits noticeable in the neural network don’t tell you what goals that cognitive architecture will end up pursuing upon reflection. In my view, this is closely related to argument 22 and Sharp Left Turn. If you don’t have strong reasons to believe you successfully dealt with those, you die.
A significant part of my point was that concrete stories are needed when you expect the process to succeed, not when you expect it to fail. There are things in the setup clearly leading to failure. I ignored some of them (e.g., RLHF) and pointed towards others: there are things that don’t incentivize something like exactly some concrete goals such that maximizing for them leaves survivors and do incentivize being more agentic and context-aware.
I specifically meant that generally, when there are highly agentic and context-aware regions of a nearby gradient space, the gradient descent will update the weights towards them, slowly moving towards installing a capable cognitive architecture. In the specific training setup I described, there’s a lot of pressure towards being more agentic: if you start rewarding an LLM for what the text ends up at, a variation of that LLM that’s more focused on getting to a result will be getting selected. If you didn’t come up with a way to point this process at a rare cognitive architecture that leaves survivors, it won’t. The capabilities will generalize to acting effectively to steer the lightcone’s future. There are reasons for unaligned goals to be closer to what AGI ends up with than aligned goals, and there’s no reason for aligned behaviour to generalize exactly the way you imagined.
I’m not aware of a way to select features leading to lots of diamonds when these features are present in a superhuman AGI. If you do RL, the story that I imagine is something like “For most loss functions/training processes you can realistically come up with, there are many goals such that pursuing them leads to the behavior you evaluate highly; a small fraction of these goals represent wanting to achieve lots of conventional diamond in the real universe; the agents you find maximize some random mixture of these goals (with goals that are less complex or can more easily emerge from the initial heuristics used or such that directly optimizing for them performs better on your loss probably having more weight); you probably don’t have the diamond-maximization-in-the-actual-universe as a significant part of these gals unless you do something really smart outside of what I think the field is on the way to achieve; and even if you do, it breaks when the sharp left turn happens.”
Human values are even more complicated than diamonds, though it might be easier to come up with a training process where it’s easier to miss the difference between what’s simple and correlated and what you think is simple and correlated. I believe the iterative process the field might be doing here mostly searches for training setups such that we’re not able to find how they fail, and most of those fail. Because of that, I think we need to have a really good and probably formal understanding of what it is that we want to end up in and that understanding should produce some strong constraints on what a training process for an aligned AGI might look like, which would then hopefully inform us/give us insights into how people should build the thing. We have almost none of that kind of research, with only infra-bayesianism currently directly attacking it AFAIK, and I’d really like to see more somewhat promising attempts at this.
Maybe it’s somewhat coming at alignment stories from the opposite direction: I think the question of where we end up and how do we get there is far more important to think about than things like “here’s a story of what path a gradient descent takes and why”.
Not important, but for the sake of completeness- an AGI might instead, e.g., look around and hack whatever it’s running on without having to score well
Thanks for your detailed and thoughtful response!
>5% under current uncertainty.
Are you saying that pre-superhuman behavior doesn’t tell you about its goals? Like, zero mutual information? Doesn’t this prove too much, without relying on more details of the training process? By observing a 5-year-old, you can definitely gather evidence about their adult goals, you just have to interpret it skillfully (which is harder for AIs, of course).
I understand this to mean: “If you understand an AI’s motivations before it’s superhuman, that tells you relatively little about its post-reflection values.” I strongly disagree. Isn’t the whole point of the AI improving itself, in order to better achieve its goals at the time of self-improvement?
I also disagree with this. I think that alignment thinking is plagued by nonspecific, nonconcrete abstract failure modes which may or may not correspond to reasonable chains of events. Often I worry that it’s just abstract reasoning all the way down—that an alignment researcher has never sketched out an actual detailed example of a situation which the abstract words describe.
For example, I think I have very little idea what the sharp left turn is supposed to be. If Nate wrote out a very detailed story, I think I would understand. I might disagree with e.g. how he thinks SGD dynamics work, but I could read the story and say “oh, because Nate thinks that time-bias allows faster circuits to gain more control over cognition, they can ‘betray’ the other motivational circuits and execute an internal coup, and we got here because [the rest of Nate’s story].”
(Importantly, these details would have to be concrete. Not “you train the AI and it stops doing what you want”, that’s not a specific concrete situation.)
But right now, there’s a strong focus on possibly inappropriate analogies with evolution. That doesn’t mean Nate is wrong. It means I don’t know what he’s talking about. I really wish I did, because he is a smart guy, and I’d like to know whether I agree or disagree with his models.
I was referring to a situation where the AI already is selecting plans on the basis of whether they lead to diamonds. This is, by assumption, a fact about its motivations. I perceive you to believe that e.g. the AI needs to keep “improving” its “diamond value” in order to, later in training, still select plans on the basis of diamonds.
If so—what does this mean? Why would that be true?