In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn’t RLHF be one of them?
Wrong question. The point is not that RLHF can’t be part of a solution, in such worlds. The point is that working on RLHF does not provide any counterfactual improvement to chances of survival, in such worlds.
Iterative design is something which happens automagically, for free, without any alignment researcher having to work on it. Customers see problems in their AI products, and companies are incentivized to fix them; that’s iterative design from human feedback baked into everyday economic incentives. Engineers notice problems in the things they’re building, open bugs in whatever tracking software they’re using, and eventually fix them; that’s iterative design baked into everyday engineering workflows. Companies hire people to test out their products, see what problems come up, then fix them; that’s iterative design baked into everyday processes. And to a large extent, the fixes will occur by collecting problem-cases and then training them away, because ML engineers already have that affordance; it’s one of the few easy ways of fixing apparent problems in ML systems. That will all happen regardless of whether any alignment researchers work on RLHF.
When I say that “in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF”, that’s what I’m talking about. Problems which RLHF can solve (i.e. problems which are easy for humans to notice and then train away) will already be solved by default, without any alignment researchers working on them. So, there is no counterfactual value in working on RLHF, even in worlds where it basically works.
I think you’re just doing the bimodal thing again. Sure, if you condition on worlds in which alignment happens automagically, then it’s not valuable to advance the techniques involved. But there’s a spectrum of possible difficulty, and in the middle parts there are worlds where RLHF works, but only because we’ve done a lot of research into it in advance (e.g. exploring things like debate); or where RLHF doesn’t work, but finding specific failure cases earlier allowed us to develop better techniques.
Yeah, ok, so I am making a substantive claim that the distribution is bimodal. (Or, more accurately, the distribution is wide and work on RLHF only counterfactually matters if we happen to land in a very specific tiny slice somewhere in the middle.) Those “middle worlds” are rare enough to be negligible; it would take a really weird accident for the world to end up such that the iteration cycles provided by ordinary economic/engineering activity would not produce aligned AI, but the extra iteration cycles provided by research into RLHF would produce aligned AI.
Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim (“I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1”) suggests you don’t actually think there’s significant probability on the lower mode, you essentially think it’s unimodal on the “iterative design fails” worlds.
I personally disagree with both the “significant probability on both modes, but not in between” hypothesis, and the “unimodal on iterative design fails” hypothesis, but I think that it’s important to be clear about which you’re defending—e.g. because if you were defending the former, then I’d want to dig into what you thought the first mode would actually look like and whether we could extend it to harder cases, whereas I wouldn’t if you were defending the latter.
Yeah, that’s fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions.
What I actually think is that:
nearly all the probability mass is on worlds the iterative design loop fails to align AGI, but...
conditional on that being wrong, nearly all the probability mass is on the number of bits of optimization from iterative design resulting from ordinary economic/engineering activity being sufficient to align AGI, i.e. it is very unlikely that adding a few extra bits of qualitatively-similar optimization pressure will make the difference. (“We are unlikely to hit/miss by a little bit” is the more general slogan.)
The second claim would be cruxy if I changed my mind on the first, and requires fewer assumptions, and therefore fewer inductive steps from readers’ pre-existing models.
In general I think it’s better to reason in terms of continuous variables like “how helpful is the iterative design loop” rather than “does it work or does it fail”?
My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it’s 5% of worlds rather than 50% of worlds.
(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment—this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let’s say I were somehow approximately-100% convinced that it’s basically possible for iterative design to produce an AI. Then I’d expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I’d expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool—otherwise they’d already be handled by the default iterative design processes.
So I guess at that point I’d be looking at quantitative usefulness of iterative design, rather than binary.
General point: it’s just really hard to get a situation where “do marginally more of the thing we already do lots of by default” is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.
Wrong question. The point is not that RLHF can’t be part of a solution, in such worlds. The point is that working on RLHF does not provide any counterfactual improvement to chances of survival, in such worlds.
Iterative design is something which happens automagically, for free, without any alignment researcher having to work on it. Customers see problems in their AI products, and companies are incentivized to fix them; that’s iterative design from human feedback baked into everyday economic incentives. Engineers notice problems in the things they’re building, open bugs in whatever tracking software they’re using, and eventually fix them; that’s iterative design baked into everyday engineering workflows. Companies hire people to test out their products, see what problems come up, then fix them; that’s iterative design baked into everyday processes. And to a large extent, the fixes will occur by collecting problem-cases and then training them away, because ML engineers already have that affordance; it’s one of the few easy ways of fixing apparent problems in ML systems. That will all happen regardless of whether any alignment researchers work on RLHF.
When I say that “in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF”, that’s what I’m talking about. Problems which RLHF can solve (i.e. problems which are easy for humans to notice and then train away) will already be solved by default, without any alignment researchers working on them. So, there is no counterfactual value in working on RLHF, even in worlds where it basically works.
I think you’re just doing the bimodal thing again. Sure, if you condition on worlds in which alignment happens automagically, then it’s not valuable to advance the techniques involved. But there’s a spectrum of possible difficulty, and in the middle parts there are worlds where RLHF works, but only because we’ve done a lot of research into it in advance (e.g. exploring things like debate); or where RLHF doesn’t work, but finding specific failure cases earlier allowed us to develop better techniques.
Yeah, ok, so I am making a substantive claim that the distribution is bimodal. (Or, more accurately, the distribution is wide and work on RLHF only counterfactually matters if we happen to land in a very specific tiny slice somewhere in the middle.) Those “middle worlds” are rare enough to be negligible; it would take a really weird accident for the world to end up such that the iteration cycles provided by ordinary economic/engineering activity would not produce aligned AI, but the extra iteration cycles provided by research into RLHF would produce aligned AI.
Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim (“I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1”) suggests you don’t actually think there’s significant probability on the lower mode, you essentially think it’s unimodal on the “iterative design fails” worlds.
I personally disagree with both the “significant probability on both modes, but not in between” hypothesis, and the “unimodal on iterative design fails” hypothesis, but I think that it’s important to be clear about which you’re defending—e.g. because if you were defending the former, then I’d want to dig into what you thought the first mode would actually look like and whether we could extend it to harder cases, whereas I wouldn’t if you were defending the latter.
Yeah, that’s fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions.
What I actually think is that:
nearly all the probability mass is on worlds the iterative design loop fails to align AGI, but...
conditional on that being wrong, nearly all the probability mass is on the number of bits of optimization from iterative design resulting from ordinary economic/engineering activity being sufficient to align AGI, i.e. it is very unlikely that adding a few extra bits of qualitatively-similar optimization pressure will make the difference. (“We are unlikely to hit/miss by a little bit” is the more general slogan.)
The second claim would be cruxy if I changed my mind on the first, and requires fewer assumptions, and therefore fewer inductive steps from readers’ pre-existing models.
In general I think it’s better to reason in terms of continuous variables like “how helpful is the iterative design loop” rather than “does it work or does it fail”?
My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it’s 5% of worlds rather than 50% of worlds.
(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment—this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let’s say I were somehow approximately-100% convinced that it’s basically possible for iterative design to produce an AI. Then I’d expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I’d expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool—otherwise they’d already be handled by the default iterative design processes.
So I guess at that point I’d be looking at quantitative usefulness of iterative design, rather than binary.
General point: it’s just really hard to get a situation where “do marginally more of the thing we already do lots of by default” is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.