This point may seem obvious, but cardinality inequality is insufficient in general. The set copy relation is required for our results
Could you give a toy example of this being insufficient (I’m assuming the “set copy relation” is the “B contains n of A” requiring)?
How does the “B contains n of A” requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).
Could you give a toy example of this being insufficient (I’m assuming the “set copy relation” is the “B contains n of A” requiring)?
A:={(1 0 0)} B:={(0 .3 .7), (0 .7 .3)}
Less opaquely, see the technical explanation for this counterexample, where the right action leads to two trajectories, and up leads to a single one.
How does the “B contains n of A” requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).
For this, I think we need to zoom out to a causal DAG (w/ choice nodes) picture of the world, over some reasonable abstractions. It’s just too unnatural to pick out deception subgraphs in an MDP, as far as I can tell, but maybe there’s another version of the argument.
If the AI cares about things-in-the-world, then if it were a singleton it could set many nodes to desired values independently. For example, the nodes might represent variable settings for different parts of the universe—what’s going on in the asteroid belt, in Alpha Centauri, etc.
But if it has to work with other agents (or, heaven forbid, be subjugated by them), it has fewer degrees of freedom in what-happens-in-the-universe. You can map copies of the “low control” configurations to the “high control” configurations several times, I think. (I think it should be possible to make precise what I mean by “control”, in a way that should fairly neatly map back onto POWER-as-average-optimal-value.)
So this implies a push for “control.” One way to get control is manipulation or deception or other trickery, and so deception is one possible way this instrumental convergence “prophecy” could be fulfilled.
You write
Could you give a toy example of this being insufficient (I’m assuming the “set copy relation” is the “B contains n of A” requiring)?
How does the “B contains n of A” requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).
A:={(1 0 0)} B:={(0 .3 .7), (0 .7 .3)}
Less opaquely, see the technical explanation for this counterexample, where the right action leads to two trajectories, and up leads to a single one.
For this, I think we need to zoom out to a causal DAG (w/ choice nodes) picture of the world, over some reasonable abstractions. It’s just too unnatural to pick out deception subgraphs in an MDP, as far as I can tell, but maybe there’s another version of the argument.
If the AI cares about things-in-the-world, then if it were a singleton it could set many nodes to desired values independently. For example, the nodes might represent variable settings for different parts of the universe—what’s going on in the asteroid belt, in Alpha Centauri, etc.
But if it has to work with other agents (or, heaven forbid, be subjugated by them), it has fewer degrees of freedom in what-happens-in-the-universe. You can map copies of the “low control” configurations to the “high control” configurations several times, I think. (I think it should be possible to make precise what I mean by “control”, in a way that should fairly neatly map back onto POWER-as-average-optimal-value.)
So this implies a push for “control.” One way to get control is manipulation or deception or other trickery, and so deception is one possible way this instrumental convergence “prophecy” could be fulfilled.