I think the problem is not quite so binary as “good/bad”. It seems to be more effective vs ineffective and beneficial vs harmful.
The problem is that effective plans are more likely to be harmful. We as a species have already done a lot of optimization in a lot of dimensions that are important to us, and the most highly effective plans almost certainly have greater side effects that make thing worse in dimensions that we aren’t explicitly telling the optimizer to care about.
It’s not so much that there’s a direct link between sparsity of effective plans and likelihood of bad outcomes, as that more complex problems (especially dealing with the real world) seem more likely to have “spurious” solutions that technically meet all the stated requirements, but aren’t what we actually want. The beneficial effective plans become sparse faster than the harmful effective plans, simply because in a more complex space there are more ways to be unexpectedly harmful than good.
Yes, I think I understand that more powerful optimizers can find more spurious solutions. But the OP seemed to be hypothesizing that you had some way to pick out the spurious from the good solutions, but saying it won’t scale because you have 10^50, not 100, bad solutions for each good one. That’s the part that seems wrong to me.
If the agent flips the first bit, it’s locked into a single trajectory. None of its actions matter anymore.
But if the agent flips the second bit – this may be suboptimal for a utility function, but the agent still has lots of choices remaining. In fact, it still can induce (n×n)T−1 observation histories. If n=100 and T=50, then that’s (100×100)49=10196 observation histories. Probably at least one of these yields greater utility than the shutdown-history utility.
And indeed, we can apply the scaling law for instrumental convergence to conclude that for every u-OH, at least 1019610196+1 of its permuted variants (weakly) prefer flipping the second pixel at t=1, over flipping the first pixel at t=1.
1019610196+1.
Choose any atom in the universe. Uniformly randomly select another atom in the universe. It’s about 10117 times more likely that these atoms are the same, than that a utility function incentivizes “dying” instead of flipping pixel 2 at t=1.
(For objectives over the agent’s full observation history, instrumental convergence strength scales exponentially with the complexity of the underlying environment—the environment in question was extremely simple in this case! For different objective classes, the scaling will be linear, but that’s still going to get you far more than 100:1 difficulty, and I don’t think we should privilege such small numbers.)
Your “harmfulness” criteria will always have some false negative rate.
If you incorrectly classify a harmful plan as beneficial one time in a million, in the former case you’ll get 10^44 plans that look good but are really harmful for every one that really is good. In the latter case you get 10000 plans that are actually good for each one that is harmful.
This would imply a fixed upper bound on the number of bits you can produce (for instance, a false negative rate of 1 in 128 implies at most 7 bits). But in practice you can produce many more than 7 bits, by double checking your answer, combining multiple sources of information, etc.
Combining multiple source of information, double checking etc are ways to decrease error probability, certainly. The problem is that they’re not independent. For highly complex spaces not only does the number of additional checks you need increase super-linearly, but the number of types of checks you need likely possibly also increases super-linearly.
beneficial effective plans become sparse faster than the harmful effective plans
The constants are more important than the trend here, whether a good plan for a pivotal act that sorts out AI risk in the medium term can be found. Discrimination of good plans only has to be improved enough to overcome the threshold of what’s needed to search plans effective enough to solve that problem.
I think the problem is not quite so binary as “good/bad”. It seems to be more effective vs ineffective and beneficial vs harmful.
The problem is that effective plans are more likely to be harmful. We as a species have already done a lot of optimization in a lot of dimensions that are important to us, and the most highly effective plans almost certainly have greater side effects that make thing worse in dimensions that we aren’t explicitly telling the optimizer to care about.
It’s not so much that there’s a direct link between sparsity of effective plans and likelihood of bad outcomes, as that more complex problems (especially dealing with the real world) seem more likely to have “spurious” solutions that technically meet all the stated requirements, but aren’t what we actually want. The beneficial effective plans become sparse faster than the harmful effective plans, simply because in a more complex space there are more ways to be unexpectedly harmful than good.
Yes, I think I understand that more powerful optimizers can find more spurious solutions. But the OP seemed to be hypothesizing that you had some way to pick out the spurious from the good solutions, but saying it won’t scale because you have 10^50, not 100, bad solutions for each good one. That’s the part that seems wrong to me.
That part does seem wrong to me. It seems wrong because 10^50 is possibly too small. See my post Seeking Power is Convergently Instrumental in a Broad Class of Environments:
(For objectives over the agent’s full observation history, instrumental convergence strength scales exponentially with the complexity of the underlying environment—the environment in question was extremely simple in this case! For different objective classes, the scaling will be linear, but that’s still going to get you far more than 100:1 difficulty, and I don’t think we should privilege such small numbers.)
Your “harmfulness” criteria will always have some false negative rate.
If you incorrectly classify a harmful plan as beneficial one time in a million, in the former case you’ll get 10^44 plans that look good but are really harmful for every one that really is good. In the latter case you get 10000 plans that are actually good for each one that is harmful.
This would imply a fixed upper bound on the number of bits you can produce (for instance, a false negative rate of 1 in 128 implies at most 7 bits). But in practice you can produce many more than 7 bits, by double checking your answer, combining multiple sources of information, etc.
Combining multiple source of information, double checking etc are ways to decrease error probability, certainly. The problem is that they’re not independent. For highly complex spaces not only does the number of additional checks you need increase super-linearly, but the number of types of checks you need likely possibly also increases super-linearly.
That’s my intuition, at least.
The constants are more important than the trend here, whether a good plan for a pivotal act that sorts out AI risk in the medium term can be found. Discrimination of good plans only has to be improved enough to overcome the threshold of what’s needed to search plans effective enough to solve that problem.