For instance, for debate, one could believe: 1) Debate will work for long enough for us to use it to help find an alignment solution. 2) Debate is a plausible basis for an alignment solution.
I generally don’t think about things in terms of this dichotomy. To me, an “alignment solution” is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don’t think you can separate these two things.
(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training models using debate without adding any more techniques; but I don’t think that really matters. We need to get to the moon, not to Andromeda.)
Oh, to be clear, with “to help find” I only mean that we expect to make significant progress using debate. If we knew we’d safely make enough progress to get to a solution, then you’re quite right that that would amount to (2). (apologies for lack of clarity if this was the miscommunication)
That’s the distinction I mean to make between (1) and (2): we need to get to the moon safely With (1) we have no idea when our rocket will explode. Similarly, we have no idea whether the moon will be far enough to know when our next rocket will explode. (not that I’m knocking robustly getting to the moon safely)
If we had some principled argument telling us how far we could push debate before things became dangerous, that’d be great. I’m claiming that we have no such argument, and that all work on debate (that I’m aware of) stands near-zero chance of finding one.
Of course I’m all for work “on debate” that aims at finding that kind of argument—however, I would expect that such work leaves the specifics of debate behind pretty quickly.
I generally don’t think about things in terms of this dichotomy. To me, an “alignment solution” is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don’t think you can separate these two things.
(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training models using debate without adding any more techniques; but I don’t think that really matters. We need to get to the moon, not to Andromeda.)
Oh, to be clear, with “to help find” I only mean that we expect to make significant progress using debate. If we knew we’d safely make enough progress to get to a solution, then you’re quite right that that would amount to (2). (apologies for lack of clarity if this was the miscommunication)
That’s the distinction I mean to make between (1) and (2): we need to get to the moon safely
With (1) we have no idea when our rocket will explode.
Similarly, we have no idea whether the moon will be far enough to know when our next rocket will explode. (not that I’m knocking robustly getting to the moon safely)
If we had some principled argument telling us how far we could push debate before things became dangerous, that’d be great. I’m claiming that we have no such argument, and that all work on debate (that I’m aware of) stands near-zero chance of finding one.
Of course I’m all for work “on debate” that aims at finding that kind of argument—however, I would expect that such work leaves the specifics of debate behind pretty quickly.