If the agent flips the first bit, it’s locked into a single trajectory. None of its actions matter anymore.
But if the agent flips the second bit – this may be suboptimal for a utility function, but the agent still has lots of choices remaining. In fact, it still can induce (n×n)T−1 observation histories. If n=100 and T=50, then that’s (100×100)49=10196 observation histories. Probably at least one of these yields greater utility than the shutdown-history utility.
And indeed, we can apply the scaling law for instrumental convergence to conclude that for every u-OH, at least 1019610196+1 of its permuted variants (weakly) prefer flipping the second pixel at t=1, over flipping the first pixel at t=1.
1019610196+1.
Choose any atom in the universe. Uniformly randomly select another atom in the universe. It’s about 10117 times more likely that these atoms are the same, than that a utility function incentivizes “dying” instead of flipping pixel 2 at t=1.
(For objectives over the agent’s full observation history, instrumental convergence strength scales exponentially with the complexity of the underlying environment—the environment in question was extremely simple in this case! For different objective classes, the scaling will be linear, but that’s still going to get you far more than 100:1 difficulty, and I don’t think we should privilege such small numbers.)
That part does seem wrong to me. It seems wrong because 10^50 is possibly too small. See my post Seeking Power is Convergently Instrumental in a Broad Class of Environments:
(For objectives over the agent’s full observation history, instrumental convergence strength scales exponentially with the complexity of the underlying environment—the environment in question was extremely simple in this case! For different objective classes, the scaling will be linear, but that’s still going to get you far more than 100:1 difficulty, and I don’t think we should privilege such small numbers.)