Okay, it really feels like we’re talking past each other here, and I’m not sure what the root cause of that is. But I suspect that it is because you are thinking of outer alignment as choosing an objective function plus an environment, thereby bundling the robustness problem into the outer alignment problem, and I am thinking of it as just choosing an objective function. I could argue further why your definition isn’t useful, but instead I’m just going to argue that it violates common usage.
Here’s Paul: “You could select a policy which performed well on the distribution, but it may be the case that that policy is trying to do something [other than what you wanted]. There might be many different policies which have values that lead to reasonably good behavior on the training distribution, but then, in certain novel situations, do something different from what I want. That’s what I mean by inner alignment.”
Here’s Evan: “[a mesa-optimiser could] get to the end of the maze on the training distribution, but it could be an objective that will do anything else, sort of off-distribution. That fundamental robustness problem of, when you train a model, and that model has an objective, how do you ensure that that objective is the one that you trained it for? That’s the inner alignment problem.”
Here’s Vlad: “Since the mesa-optimiser is selected based on performance on the base objective, we expect it (once trained) to have a good policy on the training distribution.”
All of them are talking about mesa-optimisers which perform well on the training distribution being potentially inner-misaligned during deployment. None of them claim that the only way inner misalignment can be a problem is via hacking the training process. You don’t engage with any previous definitions of inner alignment in your post, you just assert that the concept everyone else has been using is not “true” inner alignment, which seems pretty unhelpful.
(You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don’t explain why you’re justified in interpreting it the other way).
(You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don’t explain why you’re justified in interpreting it the other way).
But I suspect that it is because you are thinking of outer alignment as choosing an objective function plus an environment, thereby bundling the robustness problem into the outer alignment problem, and I am thinking of it as just choosing an objective function.
Ok, I buy that. We could definitely break up what-I’m-calling-outer-alignment into one piece that says “this reward function would give good policies high reward if the policies were evaluated on the deployment distribution” (i.e. outer alignment) and a second piece which says “policies which perform well in training also perform well in deployment” (i.e. robustness). Key point: even after separating “robustness” from “outer optimization”:
robustness + outer alignment + full optimization = fully aligned system, regardless of whether or not any inner optimizers are present
robustness is orthogonal to inner optimization
assuming outer alignment and robustness, inner alignment failure can only be a problem due to imperfect optimization
None of them claim that the only way inner misalignment can be a problem is via hacking the training process.
Yes, that’s why I wrote this post. All of these people seem to not realize the implications of their own definitions.
You seem to think I’m arguing about definitions here. I’m not. I’m mostly using definitions other people put forward (e.g. Evan’s definition of outer optimization in the OP), and arguing that those definitions imply things which their original authors do not seem to realize.
You seem to be arguing that inner alignment is/should be defined to include generalization problems, so long as an inner optimizer is present. To the extent that that’s the case, it’s inconsistent with the obvious definition of outer optimization, i.e. the definition from Evan which I used in the OP. (Or the obvious definitions of outer alignment and robustness, if you want to split that out.)
You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don’t explain why you’re justified in interpreting it the other way
He gave a mathematical definition, and the meaning of the expectation operator was pretty clear in-context (you can go read the thread I linked in the OP). It was not consistent with your conception.
You don’t engage with any previous definitions of inner alignment in your post, you just assert that the concept everyone else has been using is not “true” inner alignment
I do not assert any such thing, because it is not relevant to the argument. The argument only needs the definition of outer alignment; it says that outer alignment + full optimization = fully aligned system (or robustness + outer alignment + full optimization, if we want to split that out). Once we have that, I could just as easily substitute “silly beans” for “inner alignment”: if the system is outer aligned, then silly beans are relevant to alignment only insofar as they exploit imperfections in the outer optimizer.
Again, I’m not trying to argue definitions here. There is is a substantive argument in the OP, which applies regardless of how we define things.
He gave a mathematical definition, and the meaning of the expectation operator was pretty clear in-context (you can go read the thread I linked in the OP). It was not consistent with your conception.
First of all, the quick definition that I gave in that thread was just meant to be a shortened version of the full definition that I’ve written an entire post on and that I linked to in that thread, which is pretty explicit about only looking at the objective function and not the training process.
Second, I think even the definition I gave in that thread is consistent with only looking at the objective function. The definition I gave there just says to ask whether all policies that maximize reward in expectation are aligned, where the way I meant that expectation to be taken was over all data, not just the training data—that is, the expectation taken over the actual underlying MDP. Under that definition, only the objective function matters, not the training process or training data, as we’re only looking at the actual optimal policy in the limit of perfect training and infinite data. As a result, my definition is quite different from yours, including categorizing deception as always an inner alignment failure (see this comment).
We could definitely break up what-I’m-calling-outer-alignment into one piece that says “this reward function would give good policies high reward if the policies were evaluated on the deployment distribution” (i.e. outer alignment) and a second piece which says “policies which perform well in training also perform well in deployment” (i.e. robustness).
Fwiw, I claim that this is the actually useful decomposition. (Though I’m not going to argue for it here—I’m writing this comment mostly in the hopes that you think about this decomposition yourself.)
I’d say it slightly differently: outer alignment = “the reward function incentivizes good behavior on the training distribution” and robustness / inner alignment = “the learned policy has good behavior on the deployment distribution”. Under these definitions, all you care about is inner alignment; outer alignment is instrumentally useful towards guaranteeing inner alignment, but if we got inner alignment some other way without getting outer alignment, that would be fine.
Another decomposition is: outer alignment = “good behavior on the training distribution”, robustness / inner alignment = “non-catastrophic behavior on the deployment distribution”. In this case, outer alignment is what tells you that your agent is actually useful, and inner alignment is what tells you it is safe.
I thought this was what the mesa optimizers paper was trying to point at, but I share your sense that the things people say and write are inconsistent with this decomposition. I mostly don’t engage with discussion of mesa optimization on LW / AIAF because of this disconnect.
I share your concern that the definitions of outer and inner alignment have gotten pretty fuzzy. I think the definitions we gave in “Risks from Learned Optimization” were actually pretty precise but suffered from being limited to the situation where the model is actually an optimizer. The problem is that now, people (including myself) have taken to using the terms outside of that narrow context, which leads to these sorts of disconnects.
outer alignment = the reward function incentivizes good behavior in the limit of infinite data
inner alignment = the model actually produced by the training process has good behavior according to the reward function.
I personally like this decomposition (and it is notably quite different than John’s) as it fits well with my intuitions for what things should be classified as outer vs. inner alignment failures, though perhaps you would disagree.
Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.
Okay, it really feels like we’re talking past each other here, and I’m not sure what the root cause of that is. But I suspect that it is because you are thinking of outer alignment as choosing an objective function plus an environment, thereby bundling the robustness problem into the outer alignment problem, and I am thinking of it as just choosing an objective function. I could argue further why your definition isn’t useful, but instead I’m just going to argue that it violates common usage.
Here’s Paul: “You could select a policy which performed well on the distribution, but it may be the case that that policy is trying to do something [other than what you wanted]. There might be many different policies which have values that lead to reasonably good behavior on the training distribution, but then, in certain novel situations, do something different from what I want. That’s what I mean by inner alignment.”
Here’s Evan: “[a mesa-optimiser could] get to the end of the maze on the training distribution, but it could be an objective that will do anything else, sort of off-distribution. That fundamental robustness problem of, when you train a model, and that model has an objective, how do you ensure that that objective is the one that you trained it for? That’s the inner alignment problem.”
Here’s Vlad: “Since the mesa-optimiser is selected based on performance on the base objective, we expect it (once trained) to have a good policy on the training distribution.”
All of them are talking about mesa-optimisers which perform well on the training distribution being potentially inner-misaligned during deployment. None of them claim that the only way inner misalignment can be a problem is via hacking the training process. You don’t engage with any previous definitions of inner alignment in your post, you just assert that the concept everyone else has been using is not “true” inner alignment, which seems pretty unhelpful.
(You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don’t explain why you’re justified in interpreting it the other way).
That is in fact how I meant it—the definition I’ve given in the past of outer alignment explicitly only looks at the objective function and not the whole training process. See also this comment explaining how I think my definition differs from John’s.
Ok, I buy that. We could definitely break up what-I’m-calling-outer-alignment into one piece that says “this reward function would give good policies high reward if the policies were evaluated on the deployment distribution” (i.e. outer alignment) and a second piece which says “policies which perform well in training also perform well in deployment” (i.e. robustness). Key point: even after separating “robustness” from “outer optimization”:
robustness + outer alignment + full optimization = fully aligned system, regardless of whether or not any inner optimizers are present
robustness is orthogonal to inner optimization
assuming outer alignment and robustness, inner alignment failure can only be a problem due to imperfect optimization
Yes, that’s why I wrote this post. All of these people seem to not realize the implications of their own definitions.
You seem to think I’m arguing about definitions here. I’m not. I’m mostly using definitions other people put forward (e.g. Evan’s definition of outer optimization in the OP), and arguing that those definitions imply things which their original authors do not seem to realize.
You seem to be arguing that inner alignment is/should be defined to include generalization problems, so long as an inner optimizer is present. To the extent that that’s the case, it’s inconsistent with the obvious definition of outer optimization, i.e. the definition from Evan which I used in the OP. (Or the obvious definitions of outer alignment and robustness, if you want to split that out.)
He gave a mathematical definition, and the meaning of the expectation operator was pretty clear in-context (you can go read the thread I linked in the OP). It was not consistent with your conception.
I do not assert any such thing, because it is not relevant to the argument. The argument only needs the definition of outer alignment; it says that outer alignment + full optimization = fully aligned system (or robustness + outer alignment + full optimization, if we want to split that out). Once we have that, I could just as easily substitute “silly beans” for “inner alignment”: if the system is outer aligned, then silly beans are relevant to alignment only insofar as they exploit imperfections in the outer optimizer.
Again, I’m not trying to argue definitions here. There is is a substantive argument in the OP, which applies regardless of how we define things.
First of all, the quick definition that I gave in that thread was just meant to be a shortened version of the full definition that I’ve written an entire post on and that I linked to in that thread, which is pretty explicit about only looking at the objective function and not the training process.
Second, I think even the definition I gave in that thread is consistent with only looking at the objective function. The definition I gave there just says to ask whether all policies that maximize reward in expectation are aligned, where the way I meant that expectation to be taken was over all data, not just the training data—that is, the expectation taken over the actual underlying MDP. Under that definition, only the objective function matters, not the training process or training data, as we’re only looking at the actual optimal policy in the limit of perfect training and infinite data. As a result, my definition is quite different from yours, including categorizing deception as always an inner alignment failure (see this comment).
Fwiw, I claim that this is the actually useful decomposition. (Though I’m not going to argue for it here—I’m writing this comment mostly in the hopes that you think about this decomposition yourself.)
I’d say it slightly differently: outer alignment = “the reward function incentivizes good behavior on the training distribution” and robustness / inner alignment = “the learned policy has good behavior on the deployment distribution”. Under these definitions, all you care about is inner alignment; outer alignment is instrumentally useful towards guaranteeing inner alignment, but if we got inner alignment some other way without getting outer alignment, that would be fine.
Another decomposition is: outer alignment = “good behavior on the training distribution”, robustness / inner alignment = “non-catastrophic behavior on the deployment distribution”. In this case, outer alignment is what tells you that your agent is actually useful, and inner alignment is what tells you it is safe.
I thought this was what the mesa optimizers paper was trying to point at, but I share your sense that the things people say and write are inconsistent with this decomposition. I mostly don’t engage with discussion of mesa optimization on LW / AIAF because of this disconnect.
I share your concern that the definitions of outer and inner alignment have gotten pretty fuzzy. I think the definitions we gave in “Risks from Learned Optimization” were actually pretty precise but suffered from being limited to the situation where the model is actually an optimizer. The problem is that now, people (including myself) have taken to using the terms outside of that narrow context, which leads to these sorts of disconnects.
I wrote “Outer alignment and imitative amplification” previously to try addressing this. Based at least partly on that post, the decomposition I would favor would be:
outer alignment = the reward function incentivizes good behavior in the limit of infinite data
inner alignment = the model actually produced by the training process has good behavior according to the reward function.
I personally like this decomposition (and it is notably quite different than John’s) as it fits well with my intuitions for what things should be classified as outer vs. inner alignment failures, though perhaps you would disagree.
Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.