An additional frame of interest is: signed neutrality (just remove the absolute value) as a measure of opportunity cost propensity. That is, highly non-neutral policies lead to polarizing opportunity costs. For example, consider a maze in which half your possible destinations lie through the one-way door on the left, and half through the one-way door on the right. All policies which go anywhere are highly “polarizing” / non-neutral.
I agree that this moment of neutrality is also a facet of the “power/impact” phenomenon. However, I’m not sure I follow this part:
We can think of actions as having objective impact to the extent that they change the distribution over which values have control over which resources—that is, the extent to which they are not value-neutral. Or, phrased another way, actions have objective impact to the extent that they break the strategy-stealing assumption.
Avoiding deactivation is good for almost all goals, so there isn’t much stdev under almost any Y? Or maybe you’re using “objective impact” in a slightly different sense here? In any case, I think I get what you’re pointing at.
You’re right, I think the absolute value might actually be a problem—you want the policy to help/hurt all values relative to no-op equally, not hurt some and help others. I just edited the post to reflect that.
As for the connection between neutrality and objective impact, I think this is related to a confusion that Wei Dai pointed out, which is that I was sort of waffling between two different notions of strategy-stealing, those being:
strategy-stealing relative to all the agents present in the world (i.e. is it possible for your AI to steal the strategies of other agents in the world) and
strategy-stealing relative to a single AI (i.e. if that AI were copied many times and put in service of many different values, would it advantage some over others).
If you believe that most early AGIs will be quite similar in their alignment properties (as I generally do, since I believe that copy-and-paste is quite powerful and will generally be preferred over designing something new), then these two notions of strategy-stealing match up, which was why I was waffling between them. However, conceptually they are quite distinct.
In terms of the connection between neutrality and objective impact, I think there I was thinking about strategy-stealing in terms of notion 1, whereas for most of the rest of the post I was thinking about it in terms of notion 2. In terms of notion 1, objective impact is about changing the distribution of resources among all the agents in the world.
I really like this view.
An additional frame of interest is: signed neutrality (just remove the absolute value) as a measure of opportunity cost propensity. That is, highly non-neutral policies lead to polarizing opportunity costs. For example, consider a maze in which half your possible destinations lie through the one-way door on the left, and half through the one-way door on the right. All policies which go anywhere are highly “polarizing” / non-neutral.
I agree that this moment of neutrality is also a facet of the “power/impact” phenomenon. However, I’m not sure I follow this part:
Avoiding deactivation is good for almost all goals, so there isn’t much stdev under almost any Y? Or maybe you’re using “objective impact” in a slightly different sense here? In any case, I think I get what you’re pointing at.
You’re right, I think the absolute value might actually be a problem—you want the policy to help/hurt all values relative to no-op equally, not hurt some and help others. I just edited the post to reflect that.
As for the connection between neutrality and objective impact, I think this is related to a confusion that Wei Dai pointed out, which is that I was sort of waffling between two different notions of strategy-stealing, those being:
strategy-stealing relative to all the agents present in the world (i.e. is it possible for your AI to steal the strategies of other agents in the world) and
strategy-stealing relative to a single AI (i.e. if that AI were copied many times and put in service of many different values, would it advantage some over others).
If you believe that most early AGIs will be quite similar in their alignment properties (as I generally do, since I believe that copy-and-paste is quite powerful and will generally be preferred over designing something new), then these two notions of strategy-stealing match up, which was why I was waffling between them. However, conceptually they are quite distinct.
In terms of the connection between neutrality and objective impact, I think there I was thinking about strategy-stealing in terms of notion 1, whereas for most of the rest of the post I was thinking about it in terms of notion 2. In terms of notion 1, objective impact is about changing the distribution of resources among all the agents in the world.