We often split AI alignment into two parts: outer alignment, or ``finding a good reward function″, and inner alignment, or ``robustly optimizing that reward function″. However, these are not very precise terms, and they don’t form clean subproblems. In particular, for outer alignment, how good does the reward function have to be? Does it need to incentivize good behavior in all possible situations? How do you handle the no free lunch theorem? Perhaps you only need to handle the inputs in the training set? But then what specifies the behavior of the agent on new inputs?
This post proposes an operationalization of outer alignment that admits a clean subproblem: _low stakes alignment_. Specifically, we are given as an assumption that we don’t care much about any small number of decisions that the AI makes—only a large number of decisions, in aggregate, can have a large impact on the world. This prevents things like quickly seizing control of resources before we have a chance to react. We do not expect this assumption to be true in practice: the point here is to solve an easy subproblem, in the hopes that the solution is useful in solving the hard version of the problem.
The main power of this assumption is that we no longer have to worry about distributional shift. We can simply keep collecting new data online and training the model on the new data. Any decisions it makes in the interim period could be bad, but by the low-stakes assumption, they won’t be catastrophic. Thus, the primary challenge is in obtaining a good reward function, that incentivizes the right behavior after the model is trained. We might also worry about whether gradient descent will successfully find a model that optimizes the reward even on the training distribution—after all, gradient descent has no guarantees for non-convex problems—but it seems like to the extent that gradient descent doesn’t do this, it will probably affect aligned and unaligned models equally.
Note that this subproblem is still non-trivial, and existential catastrophes still seem possible if we fail to solve it. For example, one way that the low-stakes assumption could be made true was if we had a lot of bureaucracy and safeguards that the AI system had to go through before making any big changes to the world. It still seems possible for the AI system to cause lots of trouble if none of the bureaucracy or safeguards can understand what the AI system is doing.
Planned opinion:
I like the low-stakes assumption as a way of saying “let’s ignore distributional shift for now”. However, I think that’s more because it agrees with my intuitions about how you want to carve up the problem of alignment, rather than because it feels like an especially clean subproblem. It seems like there are other ways to get similarly clean subproblems, like “assume that the AI system is trying to optimize the true reward function”.
That being said, one way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem). Plausibly that type of cleanliness is correlated with good decompositions of the alignment problem, and so it is not a coincidence that my intuitions tend to line up with “clean subproblems”.
It seems like there are other ways to get similarly clean subproblems, like “assume that the AI system is trying to optimize the true reward function”.
My problem with this formulation is that it’s unclear what “assume that the AI system is trying to optimize the true reward function” means—e.g. what happens when there are multiple reasonable generalizations of the reward function from the training distribution to a novel input?
I guess the natural definition is that we actually give the algorithm designer a separate channel to specify a reward function directly rather than by providing examples. It’s not easy to do this (note that a reward function depends on the environment, it’s not something we can specify precisely by giving code, and the details about how the reward is embedded in the environment are critical; also note that “actually trying” then depends on the beliefs of the AI in a way that is similarly hard to define), and I have some concerns with various plausible ways of doing that.
I was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization.
It feels like the low-stakes setting is also mostly assuming away embeddedness problems? I suppose it still includes e.g. cases where the AI system subtly changes the designer’s preferences over the course of training, but it excludes e.g. direct modification of the reward, taking over the training process, etc.
I agree that “actually trying” is still hard to define, though you could avoid that messiness by saying that the goal is to provide a reward such that any optimal policy for that reward would be beneficial / aligned (and then the assumption is that a policy that is “actually trying” to pursue the objective would not do as well as the optimal policy but would not be catastrophically bad).
Just to reiterate, I agree that the low-stakes formulation is better; I just think that my reasons for believing that are different from “it’s a clean subproblem”. My reason for liking it is that it doesn’t require you to specify a perfect reward function upfront, only a reward function that is “good enough”, i.e. it incentivizes the right behavior on the examples on which the agent is actually trained. (There might be other reasons too that I’m failing to think of now.)
I was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization.
This certainly raises a lot of questions though—what form do these states take? How do I specify a reward function that takes as input a state of the world?
I agree that “actually trying” is still hard to define, though you could avoid that messiness by saying that the goal is to provide a reward such that any optimal policy for that reward would be beneficial / aligned (and then the assumption is that a policy that is “actually trying” to pursue the objective would not do as well as the optimal policy but would not be catastrophically bad).
I’m also quite scared of assuming optimality. For example, doing so would assume away sample complexity and would open up whole strategies (like arbitrarily big debate trees or debates against random opponents who happen to sometimes give good rebuttals) that I think should be off limits for algorithmic reasons regardless of the environment (and some of which are dead ends with respect to the full problem).
It feels like the low-stakes setting is also mostly assuming away embeddedness problems? I suppose it still includes e.g. cases where the AI system subtly changes the designer’s preferences over the course of training, but it excludes e.g. direct modification of the reward, taking over the training process, etc.
I feel like low-stakes makes a plausible empirical assumption under which it turns out to be possible to ignore many of the problems associated with embededness (because in fact the reward function is protected from tampering). But I’m much more scared about issues the other consequences of assuming a cartesian boundary (where e.g. I don’t even know the type signatures of the objects involved any more).
A way you could imagine this going wrong, that feels scary in the same way as the alternative problem statements, is if “are the decisions low stakes?” is a function of your training setup, so that you could unfairly exploit the magical “reward functions can’t be tampered with” assumption to do something unrealistic.
But part of why I like the low stakes assumption is that it’s about the problem you face. We’re not assuming that every reward function can’t be tampered with, just that there is some real problem in the world that has low stakes. If your algorithm introduces high stakes internally then that’s your problem and it’s not magically assumed away.
This isn’t totally fair because the utility function U in the low-stakes definition depends on your training procedure, so you could still be cheating. But I feel much, much better about it.
I think this is basically what you were saying with:
That being said, one way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem).
That seems like it might capture the core of why I like the low-stakes assumption. It’s what makes it so that you can’t exploit the assumption in an unfair way, and so that your solutions aren’t going to systematically push up against unrealistic parts of the assumption.
Yeah, all of that seems right to me (and I feel like I have a better understanding of why assumptions on inputs are better than assumptions on outputs, which was more like a vague intuition before). I’ve changed the opinion to:
I like the low-stakes assumption as a way of saying “let’s ignore distributional shift for now”. Probably the most salient alternative is something along the lines of “assume that the AI system is trying to optimize the true reward function”. The main way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem). This seems to be a lot nicer because it is harder to “unfairly” exploit a not-too-strong assumption on an input rather than on an output. See [this comment thread](https://www.alignmentforum.org/posts/TPan9sQFuPP6jgEJo/low-stakes-alignment?commentId=askebCP36Ce96ZiJa) for more discussion.
Planned summary for the Alignment Newsletter:
Planned opinion:
Sounds good/accurate.
My problem with this formulation is that it’s unclear what “assume that the AI system is trying to optimize the true reward function” means—e.g. what happens when there are multiple reasonable generalizations of the reward function from the training distribution to a novel input?
I guess the natural definition is that we actually give the algorithm designer a separate channel to specify a reward function directly rather than by providing examples. It’s not easy to do this (note that a reward function depends on the environment, it’s not something we can specify precisely by giving code, and the details about how the reward is embedded in the environment are critical; also note that “actually trying” then depends on the beliefs of the AI in a way that is similarly hard to define), and I have some concerns with various plausible ways of doing that.
I was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization.
It feels like the low-stakes setting is also mostly assuming away embeddedness problems? I suppose it still includes e.g. cases where the AI system subtly changes the designer’s preferences over the course of training, but it excludes e.g. direct modification of the reward, taking over the training process, etc.
I agree that “actually trying” is still hard to define, though you could avoid that messiness by saying that the goal is to provide a reward such that any optimal policy for that reward would be beneficial / aligned (and then the assumption is that a policy that is “actually trying” to pursue the objective would not do as well as the optimal policy but would not be catastrophically bad).
Just to reiterate, I agree that the low-stakes formulation is better; I just think that my reasons for believing that are different from “it’s a clean subproblem”. My reason for liking it is that it doesn’t require you to specify a perfect reward function upfront, only a reward function that is “good enough”, i.e. it incentivizes the right behavior on the examples on which the agent is actually trained. (There might be other reasons too that I’m failing to think of now.)
This certainly raises a lot of questions though—what form do these states take? How do I specify a reward function that takes as input a state of the world?
I’m also quite scared of assuming optimality. For example, doing so would assume away sample complexity and would open up whole strategies (like arbitrarily big debate trees or debates against random opponents who happen to sometimes give good rebuttals) that I think should be off limits for algorithmic reasons regardless of the environment (and some of which are dead ends with respect to the full problem).
I feel like low-stakes makes a plausible empirical assumption under which it turns out to be possible to ignore many of the problems associated with embededness (because in fact the reward function is protected from tampering). But I’m much more scared about issues the other consequences of assuming a cartesian boundary (where e.g. I don’t even know the type signatures of the objects involved any more).
A way you could imagine this going wrong, that feels scary in the same way as the alternative problem statements, is if “are the decisions low stakes?” is a function of your training setup, so that you could unfairly exploit the magical “reward functions can’t be tampered with” assumption to do something unrealistic.
But part of why I like the low stakes assumption is that it’s about the problem you face. We’re not assuming that every reward function can’t be tampered with, just that there is some real problem in the world that has low stakes. If your algorithm introduces high stakes internally then that’s your problem and it’s not magically assumed away.
This isn’t totally fair because the utility function U in the low-stakes definition depends on your training procedure, so you could still be cheating. But I feel much, much better about it.
I think this is basically what you were saying with:
That seems like it might capture the core of why I like the low-stakes assumption. It’s what makes it so that you can’t exploit the assumption in an unfair way, and so that your solutions aren’t going to systematically push up against unrealistic parts of the assumption.
Yeah, all of that seems right to me (and I feel like I have a better understanding of why assumptions on inputs are better than assumptions on outputs, which was more like a vague intuition before). I’ve changed the opinion to: