Thanks for the comment! (And sorry for the delayed reply- I was at a CFAR workshop when I posted this, and after the workshop took some days off.)
The text’s target audience was two people who I’d expect to understand my intuitions, so I did not attempt to justify some claims fully. Added that to the post. Also, I apologize for the post title, it’s a claim that I haven’t justified by the text, and that represents my views to a lesser extent. I stand by the other three mentioned claims. I’m not sure where exactly the crux lies, though. I’d be interested in having a call for higher-bandwidth communication. Feel free to schedule (30 min | 60 min).
I’ll try to dump chunks of my model that seem maybe relevant and might clarify where the cruxes lie a bit.
Do you expect that if the training setup described in the post produces a superhuman-level AGI, it does not kill everyone?
My picture is, roughly:
We’re interested in where we end up when we get to a superhuman AI capable enough to prevent other AGIs from appearing until alignment is solved.
There’s a large class of cognitive architectures that kill everyone.
A lot of cognitive architectures, when implemented (in a neural network or a neural network + something external) and put into a training setup from a large class of training setups, would score well[1] on any of a broad class of loss functions, quickly become highly context-aware and go do their thing in the real world (e.g., kill everyone).
There’s a smaller class of cognitive architectures that don’t kill everyone and would allow humanity to solve alignment and launch an aligned AGI.
We need to think about what cognitive architecture we aim for and need concrete stories for why we succeed at getting to them. How a thing having bits of the final structure behaves until it gets to a superhuman level is only important insofar as it helps us get to a tiny target.
(Not as certain, but part of my current picture.) If we have a neural network implementing a cognitive architecture that a superhuman AI might have, a gradient descent on loss functions of a broad range of training setups won’t change that cognitive architecture or its preferences much.
We need a concrete story for why our training setup ends up at a cognitive architecture that has a highly specific set of goals such that it doesn’t kill everyone.
No behavior we see until a superhuman cognitive structure plays a major role in producing it gives much evidence into what goals that cognitive structure might have.
(Not certain, depends on how general grokking is.) Gradient might be immediately pointing at some highly capable cognitive structures!
You need to have really good reasons to expect the cognitive structure you’ll end up with to be something that doesn’t end humanity. Before there’s a superhuman-level cognitive structure, circuits noticeable in the neural network don’t tell you what goals that cognitive architecture will end up pursuing upon reflection. In my view, this is closely related to argument 22 and Sharp Left Turn. If you don’t have strong reasons to believe you successfully dealt with those, you die.
I’m interested in hearing a concrete story here
A significant part of my point was that concrete stories are needed when you expect the process to succeed, not when you expect it to fail. There are things in the setup clearly leading to failure. I ignored some of them (e.g., RLHF) and pointed towards others: there are things that don’t incentivize something like exactly some concrete goals such that maximizing for them leaves survivors and do incentivize being more agentic and context-aware.
I specifically meant that generally, when there are highly agentic and context-aware regions of a nearby gradient space, the gradient descent will update the weights towards them, slowly moving towards installing a capable cognitive architecture. In the specific training setup I described, there’s a lot of pressure towards being more agentic: if you start rewarding an LLM for what the text ends up at, a variation of that LLM that’s more focused on getting to a result will be getting selected. If you didn’t come up with a way to point this process at a rare cognitive architecture that leaves survivors, it won’t. The capabilities will generalize to acting effectively to steer the lightcone’s future. There are reasons for unaligned goals to be closer to what AGI ends up with than aligned goals, and there’s no reason for aligned behaviour to generalize exactly the way you imagined.
If I’m choosing futures on the basis of whether I think they lead to lots of diamonds, why do I need to keep improving that value in order to keep wanting to make diamonds?
I’m not aware of a way to select features leading to lots of diamonds when these features are present in a superhuman AGI. If you do RL, the story that I imagine is something like “For most loss functions/training processes you can realistically come up with, there are many goals such that pursuing them leads to the behavior you evaluate highly; a small fraction of these goals represent wanting to achieve lots of conventional diamond in the real universe; the agents you find maximize some random mixture of these goals (with goals that are less complex or can more easily emerge from the initial heuristics used or such that directly optimizing for them performs better on your loss probably having more weight); you probably don’t have the diamond-maximization-in-the-actual-universe as a significant part of these gals unless you do something really smart outside of what I think the field is on the way to achieve; and even if you do, it breaks when the sharp left turn happens.”
Human values are even more complicated than diamonds, though it might be easier to come up with a training process where it’s easier to miss the difference between what’s simple and correlated and what you think is simple and correlated. I believe the iterative process the field might be doing here mostly searches for training setups such that we’re not able to find how they fail, and most of those fail. Because of that, I think we need to have a really good and probably formal understanding of what it is that we want to end up in and that understanding should produce some strong constraints on what a training process for an aligned AGI might look like, which would then hopefully inform us/give us insights into how people should build the thing. We have almost none of that kind of research, with only infra-bayesianism currently directly attacking it AFAIK, and I’d really like to see more somewhat promising attempts at this.
Maybe it’s somewhat coming at alignment stories from the opposite direction: I think the question of where we end up and how do we get there is far more important to think about than things like “here’s a story of what path a gradient descent takes and why”.
Not important, but for the sake of completeness- an AGI might instead, e.g., look around and hack whatever it’s running on without having to score well
Do you expect that if the training setup described in the post produces a superhuman-level AGI, it does not kill everyone?
>5% under current uncertainty.
No behavior we see until a superhuman cognitive structure plays a major role in producing it gives much evidence into what goals that cognitive structure might have.
Are you saying that pre-superhuman behavior doesn’t tell you about its goals? Like, zero mutual information? Doesn’t this prove too much, without relying on more details of the training process? By observing a 5-year-old, you can definitely gather evidence about their adult goals, you just have to interpret it skillfully (which is harder for AIs, of course).
Before there’s a superhuman-level cognitive structure, circuits noticeable in the neural network don’t tell you what goals that cognitive architecture will end up pursuing upon reflection.
I understand this to mean: “If you understand an AI’s motivations before it’s superhuman, that tells you relatively little about its post-reflection values.” I strongly disagree. Isn’t the whole point of the AI improving itself, in order to better achieve its goals at the time of self-improvement?
A significant part of my point was that concrete stories are needed when you expect the process to succeed, not when you expect it to fail.
I also disagree with this. I think that alignment thinking is plagued by nonspecific, nonconcrete abstract failure modes which may or may not correspond to reasonable chains of events. Often I worry that it’s just abstract reasoning all the way down—that an alignment researcher has never sketched out an actual detailed example of a situation which the abstract words describe.
For example, I think I have very little idea what the sharp left turn is supposed to be. If Nate wrote out a very detailed story, I think I would understand. I might disagree with e.g. how he thinks SGD dynamics work, but I could read the story and say “oh, because Nate thinks that time-bias allows faster circuits to gain more control over cognition, they can ‘betray’ the other motivational circuits and execute an internal coup, and we got here because [the rest of Nate’s story].”
(Importantly, these details would have to be concrete. Not “you train the AI and it stops doing what you want”, that’s not a specific concrete situation.)
But right now, there’s a strong focus on possibly inappropriate analogies with evolution. That doesn’t mean Nate is wrong. It means I don’t know what he’s talking about. I really wish I did, because he is a smart guy, and I’d like to know whether I agree or disagree with his models.
If I’m choosing futures on the basis of whether I think they lead to lots of diamonds, why do I need to keep improving that value in order to keep wanting to make diamonds?
I’m not aware of a way to select features leading to lots of diamonds when these features are present in a superhuman AGI.
I was referring to a situation where the AI already is selecting plans on the basis of whether they lead to diamonds. This is, by assumption, a fact about its motivations. I perceive you to believe that e.g. the AI needs to keep “improving” its “diamond value” in order to, later in training, still select plans on the basis of diamonds.
If so—what does this mean? Why would that be true?
Thanks for the comment! (And sorry for the delayed reply- I was at a CFAR workshop when I posted this, and after the workshop took some days off.)
The text’s target audience was two people who I’d expect to understand my intuitions, so I did not attempt to justify some claims fully. Added that to the post. Also, I apologize for the post title, it’s a claim that I haven’t justified by the text, and that represents my views to a lesser extent. I stand by the other three mentioned claims. I’m not sure where exactly the crux lies, though. I’d be interested in having a call for higher-bandwidth communication. Feel free to schedule (30 min | 60 min).
I’ll try to dump chunks of my model that seem maybe relevant and might clarify where the cruxes lie a bit.
Do you expect that if the training setup described in the post produces a superhuman-level AGI, it does not kill everyone?
My picture is, roughly:
We’re interested in where we end up when we get to a superhuman AI capable enough to prevent other AGIs from appearing until alignment is solved.
There’s a large class of cognitive architectures that kill everyone.
A lot of cognitive architectures, when implemented (in a neural network or a neural network + something external) and put into a training setup from a large class of training setups, would score well[1] on any of a broad class of loss functions, quickly become highly context-aware and go do their thing in the real world (e.g., kill everyone).
There’s a smaller class of cognitive architectures that don’t kill everyone and would allow humanity to solve alignment and launch an aligned AGI.
We need to think about what cognitive architecture we aim for and need concrete stories for why we succeed at getting to them. How a thing having bits of the final structure behaves until it gets to a superhuman level is only important insofar as it helps us get to a tiny target.
(Not as certain, but part of my current picture.) If we have a neural network implementing a cognitive architecture that a superhuman AI might have, a gradient descent on loss functions of a broad range of training setups won’t change that cognitive architecture or its preferences much.
We need a concrete story for why our training setup ends up at a cognitive architecture that has a highly specific set of goals such that it doesn’t kill everyone.
No behavior we see until a superhuman cognitive structure plays a major role in producing it gives much evidence into what goals that cognitive structure might have.
(Not certain, depends on how general grokking is.) Gradient might be immediately pointing at some highly capable cognitive structures!
You need to have really good reasons to expect the cognitive structure you’ll end up with to be something that doesn’t end humanity. Before there’s a superhuman-level cognitive structure, circuits noticeable in the neural network don’t tell you what goals that cognitive architecture will end up pursuing upon reflection. In my view, this is closely related to argument 22 and Sharp Left Turn. If you don’t have strong reasons to believe you successfully dealt with those, you die.
A significant part of my point was that concrete stories are needed when you expect the process to succeed, not when you expect it to fail. There are things in the setup clearly leading to failure. I ignored some of them (e.g., RLHF) and pointed towards others: there are things that don’t incentivize something like exactly some concrete goals such that maximizing for them leaves survivors and do incentivize being more agentic and context-aware.
I specifically meant that generally, when there are highly agentic and context-aware regions of a nearby gradient space, the gradient descent will update the weights towards them, slowly moving towards installing a capable cognitive architecture. In the specific training setup I described, there’s a lot of pressure towards being more agentic: if you start rewarding an LLM for what the text ends up at, a variation of that LLM that’s more focused on getting to a result will be getting selected. If you didn’t come up with a way to point this process at a rare cognitive architecture that leaves survivors, it won’t. The capabilities will generalize to acting effectively to steer the lightcone’s future. There are reasons for unaligned goals to be closer to what AGI ends up with than aligned goals, and there’s no reason for aligned behaviour to generalize exactly the way you imagined.
I’m not aware of a way to select features leading to lots of diamonds when these features are present in a superhuman AGI. If you do RL, the story that I imagine is something like “For most loss functions/training processes you can realistically come up with, there are many goals such that pursuing them leads to the behavior you evaluate highly; a small fraction of these goals represent wanting to achieve lots of conventional diamond in the real universe; the agents you find maximize some random mixture of these goals (with goals that are less complex or can more easily emerge from the initial heuristics used or such that directly optimizing for them performs better on your loss probably having more weight); you probably don’t have the diamond-maximization-in-the-actual-universe as a significant part of these gals unless you do something really smart outside of what I think the field is on the way to achieve; and even if you do, it breaks when the sharp left turn happens.”
Human values are even more complicated than diamonds, though it might be easier to come up with a training process where it’s easier to miss the difference between what’s simple and correlated and what you think is simple and correlated. I believe the iterative process the field might be doing here mostly searches for training setups such that we’re not able to find how they fail, and most of those fail. Because of that, I think we need to have a really good and probably formal understanding of what it is that we want to end up in and that understanding should produce some strong constraints on what a training process for an aligned AGI might look like, which would then hopefully inform us/give us insights into how people should build the thing. We have almost none of that kind of research, with only infra-bayesianism currently directly attacking it AFAIK, and I’d really like to see more somewhat promising attempts at this.
Maybe it’s somewhat coming at alignment stories from the opposite direction: I think the question of where we end up and how do we get there is far more important to think about than things like “here’s a story of what path a gradient descent takes and why”.
Not important, but for the sake of completeness- an AGI might instead, e.g., look around and hack whatever it’s running on without having to score well
Thanks for your detailed and thoughtful response!
>5% under current uncertainty.
Are you saying that pre-superhuman behavior doesn’t tell you about its goals? Like, zero mutual information? Doesn’t this prove too much, without relying on more details of the training process? By observing a 5-year-old, you can definitely gather evidence about their adult goals, you just have to interpret it skillfully (which is harder for AIs, of course).
I understand this to mean: “If you understand an AI’s motivations before it’s superhuman, that tells you relatively little about its post-reflection values.” I strongly disagree. Isn’t the whole point of the AI improving itself, in order to better achieve its goals at the time of self-improvement?
I also disagree with this. I think that alignment thinking is plagued by nonspecific, nonconcrete abstract failure modes which may or may not correspond to reasonable chains of events. Often I worry that it’s just abstract reasoning all the way down—that an alignment researcher has never sketched out an actual detailed example of a situation which the abstract words describe.
For example, I think I have very little idea what the sharp left turn is supposed to be. If Nate wrote out a very detailed story, I think I would understand. I might disagree with e.g. how he thinks SGD dynamics work, but I could read the story and say “oh, because Nate thinks that time-bias allows faster circuits to gain more control over cognition, they can ‘betray’ the other motivational circuits and execute an internal coup, and we got here because [the rest of Nate’s story].”
(Importantly, these details would have to be concrete. Not “you train the AI and it stops doing what you want”, that’s not a specific concrete situation.)
But right now, there’s a strong focus on possibly inappropriate analogies with evolution. That doesn’t mean Nate is wrong. It means I don’t know what he’s talking about. I really wish I did, because he is a smart guy, and I’d like to know whether I agree or disagree with his models.
I was referring to a situation where the AI already is selecting plans on the basis of whether they lead to diamonds. This is, by assumption, a fact about its motivations. I perceive you to believe that e.g. the AI needs to keep “improving” its “diamond value” in order to, later in training, still select plans on the basis of diamonds.
If so—what does this mean? Why would that be true?