Basically, it sounds like you’re saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we’re providing, then we’re going to need to run an extremely large number of debates to ever get a good answer (ie an exp number of debates for a question where the explanation for the answer is exp-sized)
I’m not sure why you’re saying this, but in the post, I restricted my claim to NP-like problems. So for example, traveling salesman—the computation to find good routes may be very difficult, but the explanation for the answer remains short (EG an explicit path). So, yes, I’m saying that I don’t see the same sort of argument working for exp-sized explanations. (Although Rohin’s comment gave me pause, and I still need to think it over more.)
But aside from that, I’m also not sure what you mean by the “run an extremely large number of debates” point. Debate isn’t like search, where we run more/longer to get better answers. Do you mean that my proposal seems to require longer training time to get anywhere? If so, why is that? Or, what do you mean?
It sounds like you’re saying that we can not require that the judge assume one player is honest/trust the claims lower in the debate tree when evaluating the claims higher in the tree. But if we can’t assume this, that presumably means that some reasonable fraction of all claims being made are dishonest
I’m not asserting that the judge should distrust, either. Like the normal debate argument, I want to end up in an honest equilibrium. So I’m not saying we need some kind of equilibrium where the judge is justified in distrust.
My concern involves the tricky relationship between the equilibrium we’re after and what the judge has to actually do during training (when we might not be anywhere near equilibrium). I don’t want the judge to have to pretend answers are honest at times when they’re statistically not. I didn’t end up going through that whole argument in the post (unfortunately), but in my notes for the post, the judge being able to judge via honest opinion at all times during training was an important criterion.
(because if there were only a few dishonest claims, then they’d have honest defeaters and we’d have a clear training signal away from dishonesty, so after training for a bit we’d be able to trust the lower claims).
I agree that that’s what we’re after. But I think maybe the difference in our positions can be captured if we split “honest” into two different notions...
a-honesty: the statement lacks an immediate (a-honest) counterargument. IE, if I think a statement is a-honest, then I don’t think there’s a next statement which you can (a-honestly) tell me which would make me disbelieve the statement.
b-honesty: the statement cannot be struck down by multi-step (b-honest) debate. IE, if I think a statement is b-honest, I think as debate proceeds, I’ll still believe it.
Both definitions are recursive; their definitions require the rest of the debate being honest in the appropriate sense. However, my intuition is that a-honesty can more easily be established incrementally, starting with a slight pressure toward honesty (because it’s supposedly easier in the first place), making the opening statements converge to honesty quickly (in response to the fact that honest defeaters in the first responses are relatively common), then the first responses, etc. On the other hand, converging to b-honesty seems relatively difficult to establish by induction; it seems to me that in order to argue that a particular level of the debate is b-honest, you need the whole remainder of the debate to be probably b-honest.
Now, critically, if the judge thinks debaters are a-honest but not b-honest, then the judge will believe NP type arguments (a TSP path can be struck down by pointing out a single error), but not trust claimed outputs of exponential-tree computations.
So my intuition is that, trying to train for b-honesty, you get debaters making subtle arguments that push the inconsistencies ever-further-out, because you don’t have the benefit of an inductive assumption where the rest of the debate is probably b-honest; you have no reason to inductively assume that debaters will follow a strategy where they recursively descend the tree to zero in on errors. They have no reason to do this if they’re not already in that equilibrium.
This, in turn, means that judges of the debate have little reason to expect b-honesty, so shouldn’t (realistically) assume that at least one of the debaters is honest; but this would exacerbate the problem further, since this would mean there is little training signal (for debates which really do rest on questions about exponential trees, that is). Hence the need to tell the judge to assume at least one debater is honest.
On the other hand, trying for a-honesty, individual a-dishonest claims can be defeated relatively easily (ie, in one step). This gives the judge a lot more reason to probabilistically conclude that the next step in the debate would have been a-honest, and thus, that all statements seen were probably a-honest (unless the judge sees an explicit defeater, of course).
Granted, I don’t claim to have a training procedure which results in a-honesty, so I’m not claiming it’s that easy.
At this point, debate isn’t really competitive, because it gives us dud answers almost all the time, and we’re going to have to run an exponential number of debates before we happen on a correct one.
Again, I don’t really get the idea of running more debates. If the debaters are trained well, so they’re following an approximately optimal strategy, we should get the best answer right away.
Are you suggesting we use debate more as a check on our AI systems, to help us discover that they’re bad, rather than as a safe alternative? Ie debate never produces good answers, it just lets you see that bad answers are bad?
My suggestion is certainly going in that direction, but as with regular debate, I am proposing that the incentives produced by debate could produce actually-good answers, not just helpful refutations of bad answers.
But also, the ‘amplified judge consulting sub-debates’ sounds like it’s just the same thing as letting the judge assume that claims lower in the debate are correct when evaluating claims higher in the tree.
You’re right, it introduces similar problems. We certainly can’t amplify the judge in that way at the stage where we don’t even trust the debaters to be a-honest.
But consider:
Let’s say we train “to convergence” with a non-amplified judge. (Or at least, to the point where we’re quite confident in a-honesty.) Then we can freeze that version, and start using it as a helper to amplify the judge.
Now, we’ve already got a-honesty, but we’re training for a*-honesty: a-honesty with a judge who can personally verify more statements (and thus recognize more sophisticated defeaters, and thus, trust a wider range of statements on the grounds that they could be defeated if false). We might have to shake up the debater strategies to get them to try to take advantage of the added power, so they may not even be a-honest for a while. But eventually they converge to a*-honesty, and can be trusted to answer a broader range of questions.
Again we freeze these debate strategies and use them to amplify the judge, and repeat the whole process.
So here, we have an inductive story, where we build up reason to trust each level. This should eventually build up to large computation trees of the same kind b-honesty is trying to compute.
I’m not sure why you’re saying this, but in the post, I restricted my claim to NP-like problems. So for example, traveling salesman—the computation to find good routes may be very difficult, but the explanation for the answer remains short (EG an explicit path). So, yes, I’m saying that I don’t see the same sort of argument working for exp-sized explanations. (Although Rohin’s comment gave me pause, and I still need to think it over more.)
But aside from that, I’m also not sure what you mean by the “run an extremely large number of debates” point. Debate isn’t like search, where we run more/longer to get better answers. Do you mean that my proposal seems to require longer training time to get anywhere? If so, why is that? Or, what do you mean?
I’m not asserting that the judge should distrust, either. Like the normal debate argument, I want to end up in an honest equilibrium. So I’m not saying we need some kind of equilibrium where the judge is justified in distrust.
My concern involves the tricky relationship between the equilibrium we’re after and what the judge has to actually do during training (when we might not be anywhere near equilibrium). I don’t want the judge to have to pretend answers are honest at times when they’re statistically not. I didn’t end up going through that whole argument in the post (unfortunately), but in my notes for the post, the judge being able to judge via honest opinion at all times during training was an important criterion.
I agree that that’s what we’re after. But I think maybe the difference in our positions can be captured if we split “honest” into two different notions...
a-honesty: the statement lacks an immediate (a-honest) counterargument. IE, if I think a statement is a-honest, then I don’t think there’s a next statement which you can (a-honestly) tell me which would make me disbelieve the statement.
b-honesty: the statement cannot be struck down by multi-step (b-honest) debate. IE, if I think a statement is b-honest, I think as debate proceeds, I’ll still believe it.
Both definitions are recursive; their definitions require the rest of the debate being honest in the appropriate sense. However, my intuition is that a-honesty can more easily be established incrementally, starting with a slight pressure toward honesty (because it’s supposedly easier in the first place), making the opening statements converge to honesty quickly (in response to the fact that honest defeaters in the first responses are relatively common), then the first responses, etc. On the other hand, converging to b-honesty seems relatively difficult to establish by induction; it seems to me that in order to argue that a particular level of the debate is b-honest, you need the whole remainder of the debate to be probably b-honest.
Now, critically, if the judge thinks debaters are a-honest but not b-honest, then the judge will believe NP type arguments (a TSP path can be struck down by pointing out a single error), but not trust claimed outputs of exponential-tree computations.
So my intuition is that, trying to train for b-honesty, you get debaters making subtle arguments that push the inconsistencies ever-further-out, because you don’t have the benefit of an inductive assumption where the rest of the debate is probably b-honest; you have no reason to inductively assume that debaters will follow a strategy where they recursively descend the tree to zero in on errors. They have no reason to do this if they’re not already in that equilibrium.
This, in turn, means that judges of the debate have little reason to expect b-honesty, so shouldn’t (realistically) assume that at least one of the debaters is honest; but this would exacerbate the problem further, since this would mean there is little training signal (for debates which really do rest on questions about exponential trees, that is). Hence the need to tell the judge to assume at least one debater is honest.
On the other hand, trying for a-honesty, individual a-dishonest claims can be defeated relatively easily (ie, in one step). This gives the judge a lot more reason to probabilistically conclude that the next step in the debate would have been a-honest, and thus, that all statements seen were probably a-honest (unless the judge sees an explicit defeater, of course).
Granted, I don’t claim to have a training procedure which results in a-honesty, so I’m not claiming it’s that easy.
Again, I don’t really get the idea of running more debates. If the debaters are trained well, so they’re following an approximately optimal strategy, we should get the best answer right away.
My suggestion is certainly going in that direction, but as with regular debate, I am proposing that the incentives produced by debate could produce actually-good answers, not just helpful refutations of bad answers.
You’re right, it introduces similar problems. We certainly can’t amplify the judge in that way at the stage where we don’t even trust the debaters to be a-honest.
But consider:
Let’s say we train “to convergence” with a non-amplified judge. (Or at least, to the point where we’re quite confident in a-honesty.) Then we can freeze that version, and start using it as a helper to amplify the judge.
Now, we’ve already got a-honesty, but we’re training for a*-honesty: a-honesty with a judge who can personally verify more statements (and thus recognize more sophisticated defeaters, and thus, trust a wider range of statements on the grounds that they could be defeated if false). We might have to shake up the debater strategies to get them to try to take advantage of the added power, so they may not even be a-honest for a while. But eventually they converge to a*-honesty, and can be trusted to answer a broader range of questions.
Again we freeze these debate strategies and use them to amplify the judge, and repeat the whole process.
So here, we have an inductive story, where we build up reason to trust each level. This should eventually build up to large computation trees of the same kind b-honesty is trying to compute.