In general I think it’s better to reason in terms of continuous variables like “how helpful is the iterative design loop” rather than “does it work or does it fail”?
My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it’s 5% of worlds rather than 50% of worlds.
(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment—this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let’s say I were somehow approximately-100% convinced that it’s basically possible for iterative design to produce an AI. Then I’d expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I’d expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool—otherwise they’d already be handled by the default iterative design processes.
So I guess at that point I’d be looking at quantitative usefulness of iterative design, rather than binary.
General point: it’s just really hard to get a situation where “do marginally more of the thing we already do lots of by default” is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.
In general I think it’s better to reason in terms of continuous variables like “how helpful is the iterative design loop” rather than “does it work or does it fail”?
My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it’s 5% of worlds rather than 50% of worlds.
(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment—this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let’s say I were somehow approximately-100% convinced that it’s basically possible for iterative design to produce an AI. Then I’d expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I’d expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool—otherwise they’d already be handled by the default iterative design processes.
So I guess at that point I’d be looking at quantitative usefulness of iterative design, rather than binary.
General point: it’s just really hard to get a situation where “do marginally more of the thing we already do lots of by default” is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.