I disagree with this definition of outer alignment and I think it is quite distinct from my definition.
My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.
My understanding of your definition is that you want to say that a training process is outer aligned if all models optimal under that training process in the limit of perfect optimization are aligned.
The major differences between our definitions appear to be that mine only talks about the objective function while yours talks about the whole training process and mine takes the infinite data limit whereas yours does not. As a result, however, I think our definitions look quite different—under my definition, for example, deception is always an inner alignment failure as you should never get deception in the limit of infinite data (since a deceptive model has to defect on some data point).
you should never get deception in the limit of infinite data (since a deceptive model has to defect on some data point).
I think a model can be deceptively aligned even if formally it maps every possible input to the correct (safe) output. For example, suppose that on input X the inference execution hacks the computer on which the inference is being executed, in order to do arbitrary consequentialist stuff (while the inference logic, as a mathematical object, formally yields the correct output for X).
Sure—we’re just trying to define things in the abstract here, though, so there’s no harm in just defining the model’s output to include stuff like that as well.
So, if a distribution shift causes a problem but there’s no inner optimizer, would that count as an inner alignment problem?
Also, infinite data does not at all imply that the training distribution is the same as the deployment distribution. Even if an inner optimizer defects on some data point, that point can be rare in the training distribution but more common in deployment, resulting in a high reward in training but bad behavior in deployment.
You could instead define outer optimization as “all models optimal under r in the limit of perfect optimization and evaluated on the deployment distribution are aligned”, which seems more like what you intend.
This all still seems to leave us with a notion of “inner alignment” which does not necessarily involve any inner optimizers.
So, if a distribution shift causes a problem but there’s no inner optimizer, would that count as an inner alignment problem?
Note that I’m only defining outer alignment here, not inner alignment. I think inner alignment is actually more difficult to define than outer alignment. In “Risks from Learned Optimization,” we define it as whether the mesa-optimizer’s mesa-objective is aligned with the base objective, but that definition only makes sense if you actually have a mesa-optimizer. If you want a more general definition, you could just define inner alignment as whether the model actually produced by the training process has good behavior according to the base objective.
Also, infinite data does not at all imply that the training distribution is the same as the deployment distribution. Even if an inner optimizer defects on some data point, that point can be rare in the training distribution but more common in deployment, resulting in a high reward in training but bad behavior in deployment.
I mean “infinite data” to include both all training data and all deployment data. In terms of what distribution, at least if we’re doing supervised learning (e.g. imitative amplification) it shouldn’t matter, since I’m asking for perfectly optimal performance, so it needs to be optimal for every data point. It gets a bit more complicated for RL, but you can just say that it needs to be optimal relative to the deployment MDP.
Ok, I buy this as a valid definition. Can you give some intuition for why it’s useful? Once we divorce the notion of “inner alignment” from inner optimizers, I do not see the point of having a name for it—the language overload will just going to lead to an unfounded intuition that inner optimizers are the main problem.
It’s worth noting that “inner optimizer” isn’t a term we used in “Risks from Learned Optimization” nor is it a term I use myself or encourage others to use—rather, we used mesa-optimizer, which doesn’t have the same parallel with inner alignment. In hindsight, we probably should have separated things out and used inner alignment exclusively for the general definition and “mesa-alignment” or some other similar term for the more specific one (perhaps I should try to write a post attempting to fix this).
I agree with John that we should define inner alignment only in contexts where mesa-optimisers exist, as you do in Risks from Learned Optimization (and as I do in AGI safety from first principles); that is, inner alignment = mesa-alignment. Under the “more general” definition you propose here, inner alignment includes any capabilities problem and any robustness problem, so it’s not really about “alignment”.
I think it’s possible to formulate definitions of inner alignment that don’t rely on mesa-optimizers but exclude capabilities (e.g. by using concepts like 2-D robustness), though I agree that it gets a lot trickier when you try to do that, which was a major motivator for why we just went with the very restrictive mechanistic definition that we ended up using in Risks.
UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they’re thinking of “inner alignment” in a way which does not necessarily involve any inner optimizer at all. They’re thinking of generalization error as “inner alignment failure” essentially by definition, regardless of whether there’s any inner optimizer involved. Conversely, they think of “outer alignment” in a way which ignores generalization errors.
I don’t think this is true. I never said that inner alignment didn’t involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.
Do you agree that (a) you’re thinking of “outer alignment” in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?
Yes—I agree with both (a) and (b). I just don’t think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.
I disagree with this definition of outer alignment and I think it is quite distinct from my definition.
My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.
My understanding of your definition is that you want to say that a training process is outer aligned if all models optimal under that training process in the limit of perfect optimization are aligned.
The major differences between our definitions appear to be that mine only talks about the objective function while yours talks about the whole training process and mine takes the infinite data limit whereas yours does not. As a result, however, I think our definitions look quite different—under my definition, for example, deception is always an inner alignment failure as you should never get deception in the limit of infinite data (since a deceptive model has to defect on some data point).
I think a model can be deceptively aligned even if formally it maps every possible input to the correct (safe) output. For example, suppose that on input X the inference execution hacks the computer on which the inference is being executed, in order to do arbitrary consequentialist stuff (while the inference logic, as a mathematical object, formally yields the correct output for X).
Sure—we’re just trying to define things in the abstract here, though, so there’s no harm in just defining the model’s output to include stuff like that as well.
So, if a distribution shift causes a problem but there’s no inner optimizer, would that count as an inner alignment problem?
Also, infinite data does not at all imply that the training distribution is the same as the deployment distribution. Even if an inner optimizer defects on some data point, that point can be rare in the training distribution but more common in deployment, resulting in a high reward in training but bad behavior in deployment.
You could instead define outer optimization as “all models optimal under r in the limit of perfect optimization and evaluated on the deployment distribution are aligned”, which seems more like what you intend.
This all still seems to leave us with a notion of “inner alignment” which does not necessarily involve any inner optimizers.
Note that I’m only defining outer alignment here, not inner alignment. I think inner alignment is actually more difficult to define than outer alignment. In “Risks from Learned Optimization,” we define it as whether the mesa-optimizer’s mesa-objective is aligned with the base objective, but that definition only makes sense if you actually have a mesa-optimizer. If you want a more general definition, you could just define inner alignment as whether the model actually produced by the training process has good behavior according to the base objective.
I mean “infinite data” to include both all training data and all deployment data. In terms of what distribution, at least if we’re doing supervised learning (e.g. imitative amplification) it shouldn’t matter, since I’m asking for perfectly optimal performance, so it needs to be optimal for every data point. It gets a bit more complicated for RL, but you can just say that it needs to be optimal relative to the deployment MDP.
Ok, I buy this as a valid definition. Can you give some intuition for why it’s useful? Once we divorce the notion of “inner alignment” from inner optimizers, I do not see the point of having a name for it—the language overload will just going to lead to an unfounded intuition that inner optimizers are the main problem.
It’s worth noting that “inner optimizer” isn’t a term we used in “Risks from Learned Optimization” nor is it a term I use myself or encourage others to use—rather, we used mesa-optimizer, which doesn’t have the same parallel with inner alignment. In hindsight, we probably should have separated things out and used inner alignment exclusively for the general definition and “mesa-alignment” or some other similar term for the more specific one (perhaps I should try to write a post attempting to fix this).
I agree with John that we should define inner alignment only in contexts where mesa-optimisers exist, as you do in Risks from Learned Optimization (and as I do in AGI safety from first principles); that is, inner alignment = mesa-alignment. Under the “more general” definition you propose here, inner alignment includes any capabilities problem and any robustness problem, so it’s not really about “alignment”.
I think it’s possible to formulate definitions of inner alignment that don’t rely on mesa-optimizers but exclude capabilities (e.g. by using concepts like 2-D robustness), though I agree that it gets a lot trickier when you try to do that, which was a major motivator for why we just went with the very restrictive mechanistic definition that we ended up using in Risks.
Ok, I just updated the end of the OP to account for these definitions. Let me know if the reworded argument makes sense.
I don’t think this is true. I never said that inner alignment didn’t involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.
Do you agree that (a) you’re thinking of “outer alignment” in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?
Yes—I agree with both (a) and (b). I just don’t think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.
Oh excellent, glad to see a fresh post on it.