This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:
1. Come up with some alignment algorithm that solves the issues identified so far
2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.
This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won’t happen. Given such a scenario, we need to argue why no failure in the same class as that scenario will happen, or we need to go back to step 1 and come up with a new algorithm.
This methodology could play out as follows:
Step 1: RL with a handcoded reward function.
Step 2: This is vulnerable to <@specification gaming@>(@Specification gaming examples in AI@).
Step 1: RL from human preferences over behavior, or other forms of human feedback.
Step 2: The system might still pursue actions that are bad that humans can’t recognize as bad. For example, it might write a well researched report on whether fetuses are moral patients, which intuitively seems good (assuming the research is good). However, this would be quite bad if the AI wow the report because it calculated that it would increase partisanship leading to civil war.
Step 1: Use iterated amplification to construct a feedback signal that is “smarter” than the AI system it is training.
Step 2: The system might pick up on <@inaccessible information@>(@Inaccessible information@) that the amplified overseer cannot find. For example, it might be able to learn a language just by staring at a large pile of data in that language, and then seek power whenever working in that language, and the amplified overseer may not be able to detect this.
Step 1: Use <@imitative generalization@>(@Imitative Generalisation (AKA ‘Learning the Prior’)@) so that the human overseer can leverage facts that can be learned by induction / pattern matching, which neural nets are great at.
Step 2: Since imitative generalization ends up learning a description of facts for some dataset, it may learn low-level facts useful for prediction on the dataset, while not including the high-level facts that tell us how the low-level facts connect to things we care about.
The post also talks about various possible objections you might have, which I’m not going to summarize here.
Planned opinion:
I’m a big fan of having a candidate algorithm in mind when reasoning about alignment. It is a lot more concrete, which makes it easier to make progress and not get lost, relative to generic reasoning from just the assumption that the AI system is superintelligent.
I’m less clear on how exactly you move between the two steps—from my perspective, there is a core reason for worry, which is something like “you can’t fully control what patterns of thought your algorithm learn, and how they’ll behave in new circumstances”, and it feels like you could always apply that as your step 2. Our algorithms are instead meant to chip away at the problem, by continually increasing our control over these patterns of thought. It seems like the author has a better defined sense of what does and doesn’t count as a valid step 2, and that makes this methodology more fruitful for him than it would be for me. More discussion [here](https://www.alignmentforum.org/posts/EF5M6CmKRd6qZk27Z/my-research-methodology?commentId=8Hq4GJtnPzpoALNtk).
rom my perspective, there is a core reason for worry, which is something like “you can’t fully control what patterns of thought your algorithm learn, and how they’ll behave in new circumstances”, and it feels like you could always apply that as your step 2
That doesn’t seem like it has quite the type signature I’m looking for. I’m imagining a story as a description of how something bad happens, so I want the story to end with “and then something bad happens.”
In some sense you could start from the trivial story “Your algorithm didn’t work and then something bad happened.” Then the “search for stories” step is really just trying to figure out if the trivial story is plausible. I think that’s pretty similar to a story like: “You can’t control what your model thinks, so in some new situation it decides to kill you.”
I’m mostly doing that by making it more and more concrete—something is plausible iff there is a plausible way to fill in all the details. E.g. how is the model thinking, and why does that lead it to decide to kill you?
Sometimes after filling in a few details I’ll see that the current story isn’t actually plausible after all (i.e. now I see how to argue that the details-so-far are contradictory). In that case I backtrack.
Sometimes I fill in enough details that I’m fairly convinced the story is plausible, i.e. that there is some way to fill in the rest of the details that’s consistent with everything I know about the world. In that case I try to come up with a new algorithm or new assumption.
(Sometimes plausibility takes the form of an argument that there is a way to fill in some set of details, e.g. maybe there’s an argument that a big enough model could certainly compute X . Or sometimes I’m just pretty convinced for heuristic reasons.)
That’s not a fully-precise methodology. But it’s roughly what I’d do. (There are many places where the the methodology in this post is not fully-precise and certainly not mechanical.)
If I was starting looking at the trivial story “and then your algorithm kills you,” my first move would usually be to try to say what kind of model was learned, which needs to behave well on the training set and plausibly kill you off distribution. Then I might try to shoot that story down by showing that some other model behaves even better on the training set or is even more readily learned (to try to contradict the part where the story needed to claim “And this was the model learned by SGD”), then gradually filling in more details as necessary to evaluate plausibility of the story.
In some sense you could start from the trivial story “Your algorithm didn’t work and then something bad happened.” Then the “search for stories” step is really just trying to figure out if the trivial story is plausible. I think that’s pretty similar to a story like: “You can’t control what your model thinks, so in some new situation it decides to kill you.”
To fill in the details more:
Assume that we’re finding an algorithm to train an agent with a sufficiently large action space (i.e. we don’t get safety via the agent having such a restricted action space that it can’t do anything unsafe).
It seems like in some sense the game is in constraining the agent’s cognition to be such that it is “safe” and “useful”. The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive.
However, there are always going to be some plausible circumstances that we didn’t consider (even if we’re talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won’t have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe.
(This wouldn’t be true if we had some sort of proof against misfiring, that doesn’t assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I’m pretty sure you agree with that.)
More generally, this story is going to be something like:
Suppose you trained your model M to do X using algorithm A.
Unfortunately, when designing algorithm A / constraining M with A, you (or amplified-you) failed to consider circumstance C as a possible situation that might happen.
As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
Circumstance C then happens in the real world, leading to an actual failure.
Obviously, I can’t usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I’m not arguing that any of this is probable. However, it seems to meet your bar of “plausible”:
there is some way to fill in the rest of the details that’s consistent with everything I know about the world.
EDIT: Or maybe more accurately, I’m not sure how exactly the stories you tell are different / more concrete than the ones above.
----
When I say you have “a better defined sense of what does and doesn’t count as a valid step 2”, I mean that there’s something in your head that disallows the story I wrote above, but allows the stories that you generally use, and I don’t know what that something is; and that’s why I would have a hard time applying your methodology myself.
----
Possible analogy / intuition pump for the general story I gave above: Human cognition is only competent in particular domains and must be relearned in new domains (like protein folding) or new circumstances (like when COVID-19 hits), and sometimes human cognition isn’t up to the task (like when being teleported to a universe with different physics and immediately dying), or doesn’t do so in a way that agrees with other humans (like how some humans would push a button that automatically wirehead everyone for all time, while others would find that abhorrent).
As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
That’s basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?
I agree this involves discretion, and indeed moving beyond the trivial story “The algorithm fails and then it turns out you die” requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I’m doing some in between thing, which is roughly like: I’m allowed to push on the story to make it more concrete along any axis, but I recognize that I won’t have time to pin down every axis so I’m basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can’t fill in a billion parameters of my model one by one this way; what’s worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).
I agree this involves discretion [...] So instead I’m doing some in between thing
Yeah, I think I feel like that’s the part where I don’t think I could replicate your intuitions (yet).
I don’t think we disagree; I’m just noting that this methodology requires a fair amount of intuition / discretion, and I don’t feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.
(Probably I could have been clearer about this in the original opinion.)
Planned summary for the Alignment Newsletter:
Planned opinion:
That doesn’t seem like it has quite the type signature I’m looking for. I’m imagining a story as a description of how something bad happens, so I want the story to end with “and then something bad happens.”
In some sense you could start from the trivial story “Your algorithm didn’t work and then something bad happened.” Then the “search for stories” step is really just trying to figure out if the trivial story is plausible. I think that’s pretty similar to a story like: “You can’t control what your model thinks, so in some new situation it decides to kill you.”
I’m mostly doing that by making it more and more concrete—something is plausible iff there is a plausible way to fill in all the details. E.g. how is the model thinking, and why does that lead it to decide to kill you?
Sometimes after filling in a few details I’ll see that the current story isn’t actually plausible after all (i.e. now I see how to argue that the details-so-far are contradictory). In that case I backtrack.
Sometimes I fill in enough details that I’m fairly convinced the story is plausible, i.e. that there is some way to fill in the rest of the details that’s consistent with everything I know about the world. In that case I try to come up with a new algorithm or new assumption.
(Sometimes plausibility takes the form of an argument that there is a way to fill in some set of details, e.g. maybe there’s an argument that a big enough model could certainly compute X . Or sometimes I’m just pretty convinced for heuristic reasons.)
That’s not a fully-precise methodology. But it’s roughly what I’d do. (There are many places where the the methodology in this post is not fully-precise and certainly not mechanical.)
If I was starting looking at the trivial story “and then your algorithm kills you,” my first move would usually be to try to say what kind of model was learned, which needs to behave well on the training set and plausibly kill you off distribution. Then I might try to shoot that story down by showing that some other model behaves even better on the training set or is even more readily learned (to try to contradict the part where the story needed to claim “And this was the model learned by SGD”), then gradually filling in more details as necessary to evaluate plausibility of the story.
To fill in the details more:
Assume that we’re finding an algorithm to train an agent with a sufficiently large action space (i.e. we don’t get safety via the agent having such a restricted action space that it can’t do anything unsafe).
It seems like in some sense the game is in constraining the agent’s cognition to be such that it is “safe” and “useful”. The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive.
However, there are always going to be some plausible circumstances that we didn’t consider (even if we’re talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won’t have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe.
(This wouldn’t be true if we had some sort of proof against misfiring, that doesn’t assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I’m pretty sure you agree with that.)
More generally, this story is going to be something like:
Suppose you trained your model M to do X using algorithm A.
Unfortunately, when designing algorithm A / constraining M with A, you (or amplified-you) failed to consider circumstance C as a possible situation that might happen.
As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
Circumstance C then happens in the real world, leading to an actual failure.
Obviously, I can’t usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I’m not arguing that any of this is probable. However, it seems to meet your bar of “plausible”:
EDIT: Or maybe more accurately, I’m not sure how exactly the stories you tell are different / more concrete than the ones above.
----
When I say you have “a better defined sense of what does and doesn’t count as a valid step 2”, I mean that there’s something in your head that disallows the story I wrote above, but allows the stories that you generally use, and I don’t know what that something is; and that’s why I would have a hard time applying your methodology myself.
----
Possible analogy / intuition pump for the general story I gave above: Human cognition is only competent in particular domains and must be relearned in new domains (like protein folding) or new circumstances (like when COVID-19 hits), and sometimes human cognition isn’t up to the task (like when being teleported to a universe with different physics and immediately dying), or doesn’t do so in a way that agrees with other humans (like how some humans would push a button that automatically wirehead everyone for all time, while others would find that abhorrent).
That’s basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?
I agree this involves discretion, and indeed moving beyond the trivial story “The algorithm fails and then it turns out you die” requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I’m doing some in between thing, which is roughly like: I’m allowed to push on the story to make it more concrete along any axis, but I recognize that I won’t have time to pin down every axis so I’m basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can’t fill in a billion parameters of my model one by one this way; what’s worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).
Yeah, I think I feel like that’s the part where I don’t think I could replicate your intuitions (yet).
I don’t think we disagree; I’m just noting that this methodology requires a fair amount of intuition / discretion, and I don’t feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.
(Probably I could have been clearer about this in the original opinion.)