Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim (“I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1”) suggests you don’t actually think there’s significant probability on the lower mode, you essentially think it’s unimodal on the “iterative design fails” worlds.
I personally disagree with both the “significant probability on both modes, but not in between” hypothesis, and the “unimodal on iterative design fails” hypothesis, but I think that it’s important to be clear about which you’re defending—e.g. because if you were defending the former, then I’d want to dig into what you thought the first mode would actually look like and whether we could extend it to harder cases, whereas I wouldn’t if you were defending the latter.
Yeah, that’s fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions.
What I actually think is that:
nearly all the probability mass is on worlds the iterative design loop fails to align AGI, but...
conditional on that being wrong, nearly all the probability mass is on the number of bits of optimization from iterative design resulting from ordinary economic/engineering activity being sufficient to align AGI, i.e. it is very unlikely that adding a few extra bits of qualitatively-similar optimization pressure will make the difference. (“We are unlikely to hit/miss by a little bit” is the more general slogan.)
The second claim would be cruxy if I changed my mind on the first, and requires fewer assumptions, and therefore fewer inductive steps from readers’ pre-existing models.
In general I think it’s better to reason in terms of continuous variables like “how helpful is the iterative design loop” rather than “does it work or does it fail”?
My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it’s 5% of worlds rather than 50% of worlds.
(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment—this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let’s say I were somehow approximately-100% convinced that it’s basically possible for iterative design to produce an AI. Then I’d expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I’d expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool—otherwise they’d already be handled by the default iterative design processes.
So I guess at that point I’d be looking at quantitative usefulness of iterative design, rather than binary.
General point: it’s just really hard to get a situation where “do marginally more of the thing we already do lots of by default” is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.
Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim (“I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1”) suggests you don’t actually think there’s significant probability on the lower mode, you essentially think it’s unimodal on the “iterative design fails” worlds.
I personally disagree with both the “significant probability on both modes, but not in between” hypothesis, and the “unimodal on iterative design fails” hypothesis, but I think that it’s important to be clear about which you’re defending—e.g. because if you were defending the former, then I’d want to dig into what you thought the first mode would actually look like and whether we could extend it to harder cases, whereas I wouldn’t if you were defending the latter.
Yeah, that’s fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions.
What I actually think is that:
nearly all the probability mass is on worlds the iterative design loop fails to align AGI, but...
conditional on that being wrong, nearly all the probability mass is on the number of bits of optimization from iterative design resulting from ordinary economic/engineering activity being sufficient to align AGI, i.e. it is very unlikely that adding a few extra bits of qualitatively-similar optimization pressure will make the difference. (“We are unlikely to hit/miss by a little bit” is the more general slogan.)
The second claim would be cruxy if I changed my mind on the first, and requires fewer assumptions, and therefore fewer inductive steps from readers’ pre-existing models.
In general I think it’s better to reason in terms of continuous variables like “how helpful is the iterative design loop” rather than “does it work or does it fail”?
My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it’s 5% of worlds rather than 50% of worlds.
(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment—this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let’s say I were somehow approximately-100% convinced that it’s basically possible for iterative design to produce an AI. Then I’d expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I’d expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool—otherwise they’d already be handled by the default iterative design processes.
So I guess at that point I’d be looking at quantitative usefulness of iterative design, rather than binary.
General point: it’s just really hard to get a situation where “do marginally more of the thing we already do lots of by default” is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.