So if there are safety problems because you deployed the agent, can that ever qualify as inner misalignment under your definition? Or will it always be the fault of the training procedure for not having prepared the agent for the distributional shift to actual deployment? Unlike masturbation, which you can do both in modern times and in the ancestral environment, almost all of the plausible AI threat scenarios involve the AI gaining access to capabilities that it didn’t have access to during training.
By contrast, under the conventional view of inner/outer alignment, we can say: bad behaviour X is an outer alignment problem if it would have been rewarded highly by the training reward function (or a natural generalisation of it). And X is an inner alignment problem if the agent is a mesa-optimiser and deliberately chose X to further its goals. And X is just a robustness problem, not an alignment problem, if neither of these apply.
Alright, let’s take those definitions and apply the argument from the OP.
bad behaviour X is an outer alignment problem if it would have been rewarded highly by the training reward function (or a natural generalisation of it)
So, if we observe bad behavior X in deployment, then one of two things must be the case:
the bad behavior was highly rewarded by the training reward function, so outer alignment failed
the bad behavior was not highly rewarded by the training reward function, so the training process failed to optimize
This applies regardless of whether there’s an inner optimizer. So, the only way an inner optimizer can cause problematic behavior in deployment is if either (a) outer alignment fails, or (b) it’s exploiting imperfect optimization during training.
So if there are safety problems because you deployed the agent, can that ever qualify as inner misalignment under your definition?
Yes, it would be an inner alignment problem if there were higher-scoring designs which the training process overlooked because the inner optimizer hacked the outer optimizer during training (i.e. the imperfect optimization case).
Or will it always be the fault of the training procedure for not having prepared the agent for the distributional shift to actual deployment?
If a distributional shift causes the problem, that is always 100% of the time a failure of the training procedure and/or training objective, by your own definition.
Okay, it really feels like we’re talking past each other here, and I’m not sure what the root cause of that is. But I suspect that it is because you are thinking of outer alignment as choosing an objective function plus an environment, thereby bundling the robustness problem into the outer alignment problem, and I am thinking of it as just choosing an objective function. I could argue further why your definition isn’t useful, but instead I’m just going to argue that it violates common usage.
Here’s Paul: “You could select a policy which performed well on the distribution, but it may be the case that that policy is trying to do something [other than what you wanted]. There might be many different policies which have values that lead to reasonably good behavior on the training distribution, but then, in certain novel situations, do something different from what I want. That’s what I mean by inner alignment.”
Here’s Evan: “[a mesa-optimiser could] get to the end of the maze on the training distribution, but it could be an objective that will do anything else, sort of off-distribution. That fundamental robustness problem of, when you train a model, and that model has an objective, how do you ensure that that objective is the one that you trained it for? That’s the inner alignment problem.”
Here’s Vlad: “Since the mesa-optimiser is selected based on performance on the base objective, we expect it (once trained) to have a good policy on the training distribution.”
All of them are talking about mesa-optimisers which perform well on the training distribution being potentially inner-misaligned during deployment. None of them claim that the only way inner misalignment can be a problem is via hacking the training process. You don’t engage with any previous definitions of inner alignment in your post, you just assert that the concept everyone else has been using is not “true” inner alignment, which seems pretty unhelpful.
(You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don’t explain why you’re justified in interpreting it the other way).
(You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don’t explain why you’re justified in interpreting it the other way).
But I suspect that it is because you are thinking of outer alignment as choosing an objective function plus an environment, thereby bundling the robustness problem into the outer alignment problem, and I am thinking of it as just choosing an objective function.
Ok, I buy that. We could definitely break up what-I’m-calling-outer-alignment into one piece that says “this reward function would give good policies high reward if the policies were evaluated on the deployment distribution” (i.e. outer alignment) and a second piece which says “policies which perform well in training also perform well in deployment” (i.e. robustness). Key point: even after separating “robustness” from “outer optimization”:
robustness + outer alignment + full optimization = fully aligned system, regardless of whether or not any inner optimizers are present
robustness is orthogonal to inner optimization
assuming outer alignment and robustness, inner alignment failure can only be a problem due to imperfect optimization
None of them claim that the only way inner misalignment can be a problem is via hacking the training process.
Yes, that’s why I wrote this post. All of these people seem to not realize the implications of their own definitions.
You seem to think I’m arguing about definitions here. I’m not. I’m mostly using definitions other people put forward (e.g. Evan’s definition of outer optimization in the OP), and arguing that those definitions imply things which their original authors do not seem to realize.
You seem to be arguing that inner alignment is/should be defined to include generalization problems, so long as an inner optimizer is present. To the extent that that’s the case, it’s inconsistent with the obvious definition of outer optimization, i.e. the definition from Evan which I used in the OP. (Or the obvious definitions of outer alignment and robustness, if you want to split that out.)
You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don’t explain why you’re justified in interpreting it the other way
He gave a mathematical definition, and the meaning of the expectation operator was pretty clear in-context (you can go read the thread I linked in the OP). It was not consistent with your conception.
You don’t engage with any previous definitions of inner alignment in your post, you just assert that the concept everyone else has been using is not “true” inner alignment
I do not assert any such thing, because it is not relevant to the argument. The argument only needs the definition of outer alignment; it says that outer alignment + full optimization = fully aligned system (or robustness + outer alignment + full optimization, if we want to split that out). Once we have that, I could just as easily substitute “silly beans” for “inner alignment”: if the system is outer aligned, then silly beans are relevant to alignment only insofar as they exploit imperfections in the outer optimizer.
Again, I’m not trying to argue definitions here. There is is a substantive argument in the OP, which applies regardless of how we define things.
He gave a mathematical definition, and the meaning of the expectation operator was pretty clear in-context (you can go read the thread I linked in the OP). It was not consistent with your conception.
First of all, the quick definition that I gave in that thread was just meant to be a shortened version of the full definition that I’ve written an entire post on and that I linked to in that thread, which is pretty explicit about only looking at the objective function and not the training process.
Second, I think even the definition I gave in that thread is consistent with only looking at the objective function. The definition I gave there just says to ask whether all policies that maximize reward in expectation are aligned, where the way I meant that expectation to be taken was over all data, not just the training data—that is, the expectation taken over the actual underlying MDP. Under that definition, only the objective function matters, not the training process or training data, as we’re only looking at the actual optimal policy in the limit of perfect training and infinite data. As a result, my definition is quite different from yours, including categorizing deception as always an inner alignment failure (see this comment).
We could definitely break up what-I’m-calling-outer-alignment into one piece that says “this reward function would give good policies high reward if the policies were evaluated on the deployment distribution” (i.e. outer alignment) and a second piece which says “policies which perform well in training also perform well in deployment” (i.e. robustness).
Fwiw, I claim that this is the actually useful decomposition. (Though I’m not going to argue for it here—I’m writing this comment mostly in the hopes that you think about this decomposition yourself.)
I’d say it slightly differently: outer alignment = “the reward function incentivizes good behavior on the training distribution” and robustness / inner alignment = “the learned policy has good behavior on the deployment distribution”. Under these definitions, all you care about is inner alignment; outer alignment is instrumentally useful towards guaranteeing inner alignment, but if we got inner alignment some other way without getting outer alignment, that would be fine.
Another decomposition is: outer alignment = “good behavior on the training distribution”, robustness / inner alignment = “non-catastrophic behavior on the deployment distribution”. In this case, outer alignment is what tells you that your agent is actually useful, and inner alignment is what tells you it is safe.
I thought this was what the mesa optimizers paper was trying to point at, but I share your sense that the things people say and write are inconsistent with this decomposition. I mostly don’t engage with discussion of mesa optimization on LW / AIAF because of this disconnect.
I share your concern that the definitions of outer and inner alignment have gotten pretty fuzzy. I think the definitions we gave in “Risks from Learned Optimization” were actually pretty precise but suffered from being limited to the situation where the model is actually an optimizer. The problem is that now, people (including myself) have taken to using the terms outside of that narrow context, which leads to these sorts of disconnects.
outer alignment = the reward function incentivizes good behavior in the limit of infinite data
inner alignment = the model actually produced by the training process has good behavior according to the reward function.
I personally like this decomposition (and it is notably quite different than John’s) as it fits well with my intuitions for what things should be classified as outer vs. inner alignment failures, though perhaps you would disagree.
Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.
Actually, maybe the main reason why this whole line of thinking seems strange to me is because the whole reason the inner alignment term was coined is to point at a specific type of generalisation failure that we’re particularly worried about (see https://www.lesswrong.com/posts/2mhFMgtAjFJesaSYR/2-d-robustness). So you defining generalisation problems as part of outer misalignment basically defeats the point of having the inner alignment concept.
I do want to throw away the idea that deceptive alignment is mainly about the inner optimizer. It’s not. If the system were outer aligned to begin with, then deceptive optimizers wouldn’t be an issue, and if the system were not outer aligned to begin with, then the same imperfections which give rise to deceptive alignment would still cause problems even in the absence of inner optimizers. For instance, inner optimizers are certainly not the only problem caused by distribution shift.
This does not mean throwing away the whole concept of inner alignment, though. Inner optimizers hacking the training process is still a serious problem.
So if there are safety problems because you deployed the agent, can that ever qualify as inner misalignment under your definition? Or will it always be the fault of the training procedure for not having prepared the agent for the distributional shift to actual deployment? Unlike masturbation, which you can do both in modern times and in the ancestral environment, almost all of the plausible AI threat scenarios involve the AI gaining access to capabilities that it didn’t have access to during training.
By contrast, under the conventional view of inner/outer alignment, we can say: bad behaviour X is an outer alignment problem if it would have been rewarded highly by the training reward function (or a natural generalisation of it). And X is an inner alignment problem if the agent is a mesa-optimiser and deliberately chose X to further its goals. And X is just a robustness problem, not an alignment problem, if neither of these apply.
Alright, let’s take those definitions and apply the argument from the OP.
So, if we observe bad behavior X in deployment, then one of two things must be the case:
the bad behavior was highly rewarded by the training reward function, so outer alignment failed
the bad behavior was not highly rewarded by the training reward function, so the training process failed to optimize
This applies regardless of whether there’s an inner optimizer. So, the only way an inner optimizer can cause problematic behavior in deployment is if either (a) outer alignment fails, or (b) it’s exploiting imperfect optimization during training.
Yes, it would be an inner alignment problem if there were higher-scoring designs which the training process overlooked because the inner optimizer hacked the outer optimizer during training (i.e. the imperfect optimization case).
If a distributional shift causes the problem, that is always 100% of the time a failure of the training procedure and/or training objective, by your own definition.
Okay, it really feels like we’re talking past each other here, and I’m not sure what the root cause of that is. But I suspect that it is because you are thinking of outer alignment as choosing an objective function plus an environment, thereby bundling the robustness problem into the outer alignment problem, and I am thinking of it as just choosing an objective function. I could argue further why your definition isn’t useful, but instead I’m just going to argue that it violates common usage.
Here’s Paul: “You could select a policy which performed well on the distribution, but it may be the case that that policy is trying to do something [other than what you wanted]. There might be many different policies which have values that lead to reasonably good behavior on the training distribution, but then, in certain novel situations, do something different from what I want. That’s what I mean by inner alignment.”
Here’s Evan: “[a mesa-optimiser could] get to the end of the maze on the training distribution, but it could be an objective that will do anything else, sort of off-distribution. That fundamental robustness problem of, when you train a model, and that model has an objective, how do you ensure that that objective is the one that you trained it for? That’s the inner alignment problem.”
Here’s Vlad: “Since the mesa-optimiser is selected based on performance on the base objective, we expect it (once trained) to have a good policy on the training distribution.”
All of them are talking about mesa-optimisers which perform well on the training distribution being potentially inner-misaligned during deployment. None of them claim that the only way inner misalignment can be a problem is via hacking the training process. You don’t engage with any previous definitions of inner alignment in your post, you just assert that the concept everyone else has been using is not “true” inner alignment, which seems pretty unhelpful.
(You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don’t explain why you’re justified in interpreting it the other way).
That is in fact how I meant it—the definition I’ve given in the past of outer alignment explicitly only looks at the objective function and not the whole training process. See also this comment explaining how I think my definition differs from John’s.
Ok, I buy that. We could definitely break up what-I’m-calling-outer-alignment into one piece that says “this reward function would give good policies high reward if the policies were evaluated on the deployment distribution” (i.e. outer alignment) and a second piece which says “policies which perform well in training also perform well in deployment” (i.e. robustness). Key point: even after separating “robustness” from “outer optimization”:
robustness + outer alignment + full optimization = fully aligned system, regardless of whether or not any inner optimizers are present
robustness is orthogonal to inner optimization
assuming outer alignment and robustness, inner alignment failure can only be a problem due to imperfect optimization
Yes, that’s why I wrote this post. All of these people seem to not realize the implications of their own definitions.
You seem to think I’m arguing about definitions here. I’m not. I’m mostly using definitions other people put forward (e.g. Evan’s definition of outer optimization in the OP), and arguing that those definitions imply things which their original authors do not seem to realize.
You seem to be arguing that inner alignment is/should be defined to include generalization problems, so long as an inner optimizer is present. To the extent that that’s the case, it’s inconsistent with the obvious definition of outer optimization, i.e. the definition from Evan which I used in the OP. (Or the obvious definitions of outer alignment and robustness, if you want to split that out.)
He gave a mathematical definition, and the meaning of the expectation operator was pretty clear in-context (you can go read the thread I linked in the OP). It was not consistent with your conception.
I do not assert any such thing, because it is not relevant to the argument. The argument only needs the definition of outer alignment; it says that outer alignment + full optimization = fully aligned system (or robustness + outer alignment + full optimization, if we want to split that out). Once we have that, I could just as easily substitute “silly beans” for “inner alignment”: if the system is outer aligned, then silly beans are relevant to alignment only insofar as they exploit imperfections in the outer optimizer.
Again, I’m not trying to argue definitions here. There is is a substantive argument in the OP, which applies regardless of how we define things.
First of all, the quick definition that I gave in that thread was just meant to be a shortened version of the full definition that I’ve written an entire post on and that I linked to in that thread, which is pretty explicit about only looking at the objective function and not the training process.
Second, I think even the definition I gave in that thread is consistent with only looking at the objective function. The definition I gave there just says to ask whether all policies that maximize reward in expectation are aligned, where the way I meant that expectation to be taken was over all data, not just the training data—that is, the expectation taken over the actual underlying MDP. Under that definition, only the objective function matters, not the training process or training data, as we’re only looking at the actual optimal policy in the limit of perfect training and infinite data. As a result, my definition is quite different from yours, including categorizing deception as always an inner alignment failure (see this comment).
Fwiw, I claim that this is the actually useful decomposition. (Though I’m not going to argue for it here—I’m writing this comment mostly in the hopes that you think about this decomposition yourself.)
I’d say it slightly differently: outer alignment = “the reward function incentivizes good behavior on the training distribution” and robustness / inner alignment = “the learned policy has good behavior on the deployment distribution”. Under these definitions, all you care about is inner alignment; outer alignment is instrumentally useful towards guaranteeing inner alignment, but if we got inner alignment some other way without getting outer alignment, that would be fine.
Another decomposition is: outer alignment = “good behavior on the training distribution”, robustness / inner alignment = “non-catastrophic behavior on the deployment distribution”. In this case, outer alignment is what tells you that your agent is actually useful, and inner alignment is what tells you it is safe.
I thought this was what the mesa optimizers paper was trying to point at, but I share your sense that the things people say and write are inconsistent with this decomposition. I mostly don’t engage with discussion of mesa optimization on LW / AIAF because of this disconnect.
I share your concern that the definitions of outer and inner alignment have gotten pretty fuzzy. I think the definitions we gave in “Risks from Learned Optimization” were actually pretty precise but suffered from being limited to the situation where the model is actually an optimizer. The problem is that now, people (including myself) have taken to using the terms outside of that narrow context, which leads to these sorts of disconnects.
I wrote “Outer alignment and imitative amplification” previously to try addressing this. Based at least partly on that post, the decomposition I would favor would be:
outer alignment = the reward function incentivizes good behavior in the limit of infinite data
inner alignment = the model actually produced by the training process has good behavior according to the reward function.
I personally like this decomposition (and it is notably quite different than John’s) as it fits well with my intuitions for what things should be classified as outer vs. inner alignment failures, though perhaps you would disagree.
Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.
Actually, maybe the main reason why this whole line of thinking seems strange to me is because the whole reason the inner alignment term was coined is to point at a specific type of generalisation failure that we’re particularly worried about (see https://www.lesswrong.com/posts/2mhFMgtAjFJesaSYR/2-d-robustness). So you defining generalisation problems as part of outer misalignment basically defeats the point of having the inner alignment concept.
I do want to throw away the idea that deceptive alignment is mainly about the inner optimizer. It’s not. If the system were outer aligned to begin with, then deceptive optimizers wouldn’t be an issue, and if the system were not outer aligned to begin with, then the same imperfections which give rise to deceptive alignment would still cause problems even in the absence of inner optimizers. For instance, inner optimizers are certainly not the only problem caused by distribution shift.
This does not mean throwing away the whole concept of inner alignment, though. Inner optimizers hacking the training process is still a serious problem.