Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think
The sharp left turn is not a simple observation that we’ve seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment.
Many times over the past year, I’ve been surprised by people in the field who’ve read Nate’s post but somehow completely missed the part where it talks about specific dynamics that lead to alignment properties breaking during capabilities generalisation. To fulfil the reviewing duty and to have a place to point people to, I’ll try to write down some related intuitions that I talked about throughout 2023 when trying to get people to have intuitions on what the sharp left turn problem is about.
For example, imagine training a neural network with RL. For a while during training, the neural network might be implementing a fuzzy collection of algorithms and various heuristics that together kinda optimise for some goals. The gradient strongly points towards greater capabilities. Some of these algorithms and heuristics might be more useful for the task the neural network is being evaluated on, and they’ll persist more and what the neural network is doing as a whole will look a bit more like what the most helpful parts of it are doing.
Some of these algorithms and heuristics might be more agentic and do more for long-term goal achievement than others. As being better at achieving goals correlates with greater performance, the neural network becomes, as a whole, more capable of achieving goals. Or, maybe the transition that leads to capabilities generalisation can be more akin to grokking: even with a fuzzy solution, the distant general coherent agent implementations might still be visible to the gradient, and at some point, there might be a switch from a fuzzy collection of things together kind of optimising for some goals into a coherent agent optimising for some goals.
In any case, there’s this strong gradient pointing towards capabilities generalisation.
The issue is that a more coherent and more agentic solution might have goals different from what the fuzzier solution had been achieving and still perform better. The goal-contents of the coherent agent are stored in a way different from how a fuzzier solution had stored the stuff it had kind of optimised for. This means that the gradient points towards the architecture that implements a more general and coherent agent; but it doesn’t point towards the kind of agent that has the same goals the current fuzzy solution has; alignment properties of the current fuzzy solution don’t influence the goals of a more coherent agent the gradient points towards.
It is also likely that the components of the fuzzy solution undergo optimisation pressure which means that the whole thing grows towards the direction near components that can outcompete others. If a component is slightly slightly better at agency, at situational awareness, etc., , it might mean it gets to have the whole thing slightly more like it after an optimisation step. The goals these components get could be quite different from what they, together, were kind of optimising for. That means that the whole thing changes and grows towards parts of it with different goals. So, at the point where some parts of the fuzzy solution are near being generally smart and agentic, they might get increasingly smart and agentic, causing the whole system to transform into something with more general capabilities but without gradient also pointing towards the preservation of the goals/alignment properties of the system.
I haven’t worked on this problem and don’t understand it well; but I think it is a real and important problem, and so I’m sad that many haven’t read this post or only skimmed through it or read it but still didn’t understand what it’s talking about. It could be that it’s hard to communicate the problem (maybe intuitions around optimisation are non-native to many?); it could be that not enough resources were spent on optimising the post for communicating the problem well; it could be that the post tried hard not to communicate something related; or it could be that for a general LessWrong reader, it’s not a well-written post.
Even if this post failed to communicate its ideas to its target audience, I still believe it is one of the most important LessWrong posts in 2022 and contributed something new and important to the core of our understanding of the AI alignment problem.
Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think
The sharp left turn is not a simple observation that we’ve seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment.
Many times over the past year, I’ve been surprised by people in the field who’ve read Nate’s post but somehow completely missed the part where it talks about specific dynamics that lead to alignment properties breaking during capabilities generalisation. To fulfil the reviewing duty and to have a place to point people to, I’ll try to write down some related intuitions that I talked about throughout 2023 when trying to get people to have intuitions on what the sharp left turn problem is about.
For example, imagine training a neural network with RL. For a while during training, the neural network might be implementing a fuzzy collection of algorithms and various heuristics that together kinda optimise for some goals. The gradient strongly points towards greater capabilities. Some of these algorithms and heuristics might be more useful for the task the neural network is being evaluated on, and they’ll persist more and what the neural network is doing as a whole will look a bit more like what the most helpful parts of it are doing.
Some of these algorithms and heuristics might be more agentic and do more for long-term goal achievement than others. As being better at achieving goals correlates with greater performance, the neural network becomes, as a whole, more capable of achieving goals. Or, maybe the transition that leads to capabilities generalisation can be more akin to grokking: even with a fuzzy solution, the distant general coherent agent implementations might still be visible to the gradient, and at some point, there might be a switch from a fuzzy collection of things together kind of optimising for some goals into a coherent agent optimising for some goals.
In any case, there’s this strong gradient pointing towards capabilities generalisation.
The issue is that a more coherent and more agentic solution might have goals different from what the fuzzier solution had been achieving and still perform better. The goal-contents of the coherent agent are stored in a way different from how a fuzzier solution had stored the stuff it had kind of optimised for. This means that the gradient points towards the architecture that implements a more general and coherent agent; but it doesn’t point towards the kind of agent that has the same goals the current fuzzy solution has; alignment properties of the current fuzzy solution don’t influence the goals of a more coherent agent the gradient points towards.
It is also likely that the components of the fuzzy solution undergo optimisation pressure which means that the whole thing grows towards the direction near components that can outcompete others. If a component is slightly slightly better at agency, at situational awareness, etc., , it might mean it gets to have the whole thing slightly more like it after an optimisation step. The goals these components get could be quite different from what they, together, were kind of optimising for. That means that the whole thing changes and grows towards parts of it with different goals. So, at the point where some parts of the fuzzy solution are near being generally smart and agentic, they might get increasingly smart and agentic, causing the whole system to transform into something with more general capabilities but without gradient also pointing towards the preservation of the goals/alignment properties of the system.
I haven’t worked on this problem and don’t understand it well; but I think it is a real and important problem, and so I’m sad that many haven’t read this post or only skimmed through it or read it but still didn’t understand what it’s talking about. It could be that it’s hard to communicate the problem (maybe intuitions around optimisation are non-native to many?); it could be that not enough resources were spent on optimising the post for communicating the problem well; it could be that the post tried hard not to communicate something related; or it could be that for a general LessWrong reader, it’s not a well-written post.
Even if this post failed to communicate its ideas to its target audience, I still believe it is one of the most important LessWrong posts in 2022 and contributed something new and important to the core of our understanding of the AI alignment problem.