So far, most <@uses of impact formalizations@> don’t help with inner alignment, because we simply add impact to the (outer) loss function. This post suggests that impact formalizations could also be adapted to verify whether an optimization algorithm is _value-neutral_ -- that is, no matter what objective you apply it towards, it provides approximately the same benefit. In particular, <@AUP@> measures the _expectation_ of the distribution of changes in attainable utilities for a given action. You could get a measure of the value-neutrality of an action by instead computing the _standard deviation_ of this distribution, since that measures how different the changes in utility are. (Evan would use policies instead of actions, but conceptually that’s a minor difference.) Verifying value-neutrality could be used to ensure that the <@strategy-stealing assumption@> is true.
Planned opinion:
I continue to be confused about the purpose of the strategy-stealing assumption, so I don’t have a strong opinion about the importance of value-neutrality verification. I do think that the distribution of changes to attainable utilities is a powerful mathematical object, and it makes sense that there are other properties of interest that involve analyzing it.
I think this comment by Wei Dai does a good job of clarifying what’s going on with the strategy-stealing assumption. I know Wei Dai was also confused about the purpose of the strategy-stealing assumption for a while until he wrote that comment.
I understand the point made in that comment; the part I’m confused about is why the two subpoints in that comment are true:
If “strategy-stealing assumption” is true, we can get most of what we “really” want by doing strategy-stealing. (Example of how this can be false: (Logical) Time is of the essence)
It’s not too hard to make “strategy-stealing assumption” true.
Like… why? If we have unaligned AI but not aligned AI, then we have failed to make the strategy-stealing assumption true. If we do succeed in building aligned AI, why are we worried about unaligned AI, since we presumably won’t deploy it (and so strategy-stealing is irrelevant)? I could imagine that some people mistakenly think that unaligned AI is actually aligned and so build it, or that some malicious actors build AI aligned with them, and the strategy-stealing assumption means that this is basically fine as long as they don’t start out with too many resources, but this doesn’t seem like the mainline scenario to worry about: it seems much more relevant whether we can align AI or not.
I could imagine that some people mistakenly think that unaligned AI is actually aligned and so build it, or that some malicious actors build AI aligned with them, and the strategy-stealing assumption means that this is basically fine as long as they don’t start out with too many resources, but this doesn’t seem like the mainline scenario to worry about: it seems much more relevant whether we can align AI or not.
That’s not the scenario I’m thinking about when I think about strategy-stealing. I mentioned this a bit in this comment, but when I think about strategy-stealing I’m generally thinking about it as an alignment property that may or may not hold for a single AI: namely, the property that the AI is equally good at optimizing all of the different things we might want it to optimize. If this property doesn’t hold, then you get something like Paul’s going out with a whimper where our easy-to-specify values win out over our other values.
Furthermore, I agree with you that I generally expect basically all early AGIs to have similar alignment properties, though I think you push a lot under the rug when you say they’ll all either be “aligned” or “unaligned.” In particular, I generally imagine producing an AGI that is corrigible in that it’s trying to do what you want, but isn’t necessarily fully aligned in the sense of figuring out what you want for you. In such a case, it’s very important that your AGI not be better at optimizing some of your values over others, as that will shift the distribution of value/resources/etc. away from the real human preference distribution that we want.
Also, value-neutrality verification isn’t just about strategy-stealing: it’s also about inner alignment, since it could help you separate optimization processes from objectives in a natural way that makes it easier to verify alignment properties (such as compatibility with strategy-stealing, but also possibly corrigibility) on those objects.
Hmm, I somehow never saw this reply, sorry about that.
you get something like Paul’s going out with a whimper where our easy-to-specify values win out over our other values [...] it’s very important that your AGI not be better at optimizing some of your values over others, as that will shift the distribution of value/resources/etc. away from the real human preference distribution that we want.
Why can’t we tell it not to overoptimize the aspects that it understands until it figures out the other aspects?
value-neutrality verification isn’t just about strategy-stealing: it’s also about inner alignment, since it could help you separate optimization processes from objectives in a natural way that makes it easier to verify alignment properties (such as compatibility with strategy-stealing, but also possibly corrigibility) on those objects.
As you (now) know, my main crux is that I don’t expect to be able to cleanly separate optimization and objectives, though I also am unclear whether value-neutral optimization is even a sensible concept taken separately from the environment in which the agent is acting (see this comment).
Planned summary:
So far, most <@uses of impact formalizations@> don’t help with inner alignment, because we simply add impact to the (outer) loss function. This post suggests that impact formalizations could also be adapted to verify whether an optimization algorithm is _value-neutral_ -- that is, no matter what objective you apply it towards, it provides approximately the same benefit. In particular, <@AUP@> measures the _expectation_ of the distribution of changes in attainable utilities for a given action. You could get a measure of the value-neutrality of an action by instead computing the _standard deviation_ of this distribution, since that measures how different the changes in utility are. (Evan would use policies instead of actions, but conceptually that’s a minor difference.) Verifying value-neutrality could be used to ensure that the <@strategy-stealing assumption@> is true.
Planned opinion:
I continue to be confused about the purpose of the strategy-stealing assumption, so I don’t have a strong opinion about the importance of value-neutrality verification. I do think that the distribution of changes to attainable utilities is a powerful mathematical object, and it makes sense that there are other properties of interest that involve analyzing it.
I think this comment by Wei Dai does a good job of clarifying what’s going on with the strategy-stealing assumption. I know Wei Dai was also confused about the purpose of the strategy-stealing assumption for a while until he wrote that comment.
I understand the point made in that comment; the part I’m confused about is why the two subpoints in that comment are true:
Like… why? If we have unaligned AI but not aligned AI, then we have failed to make the strategy-stealing assumption true. If we do succeed in building aligned AI, why are we worried about unaligned AI, since we presumably won’t deploy it (and so strategy-stealing is irrelevant)? I could imagine that some people mistakenly think that unaligned AI is actually aligned and so build it, or that some malicious actors build AI aligned with them, and the strategy-stealing assumption means that this is basically fine as long as they don’t start out with too many resources, but this doesn’t seem like the mainline scenario to worry about: it seems much more relevant whether we can align AI or not.
That’s not the scenario I’m thinking about when I think about strategy-stealing. I mentioned this a bit in this comment, but when I think about strategy-stealing I’m generally thinking about it as an alignment property that may or may not hold for a single AI: namely, the property that the AI is equally good at optimizing all of the different things we might want it to optimize. If this property doesn’t hold, then you get something like Paul’s going out with a whimper where our easy-to-specify values win out over our other values.
Furthermore, I agree with you that I generally expect basically all early AGIs to have similar alignment properties, though I think you push a lot under the rug when you say they’ll all either be “aligned” or “unaligned.” In particular, I generally imagine producing an AGI that is corrigible in that it’s trying to do what you want, but isn’t necessarily fully aligned in the sense of figuring out what you want for you. In such a case, it’s very important that your AGI not be better at optimizing some of your values over others, as that will shift the distribution of value/resources/etc. away from the real human preference distribution that we want.
Also, value-neutrality verification isn’t just about strategy-stealing: it’s also about inner alignment, since it could help you separate optimization processes from objectives in a natural way that makes it easier to verify alignment properties (such as compatibility with strategy-stealing, but also possibly corrigibility) on those objects.
Hmm, I somehow never saw this reply, sorry about that.
Why can’t we tell it not to overoptimize the aspects that it understands until it figures out the other aspects?
As you (now) know, my main crux is that I don’t expect to be able to cleanly separate optimization and objectives, though I also am unclear whether value-neutral optimization is even a sensible concept taken separately from the environment in which the agent is acting (see this comment).