My non-answer to (2) would be that debate could be used in all of these ways, and the central problem it’s trying to solve is sort of orthogonal to how exactly it’s being used. (Also, the best way to use it might depend on the context.)
What debate is trying to do is let you evaluate plans/actions/outputs that an unassisted human couldn’t evaluate correctly (in any reasonable amount of time). You might want to use that to train a reward model (replacing humans in RLHF) and then train a policy; this would most likely be necessary if you want low cost at inference time. But it also seems plausible that you’d use it at runtime if inference costs aren’t a huge bottleneck and you’d rather get some performance or safety boost from avoiding distillation steps.
I think the problem of “How can we evaluate outputs that a single human can’t feasibly evaluate?” is pretty reasonable to study independently, agnostic to how you’ll use this evaluation procedure. The main variable is how efficient the evaluation procedure needs to be, and I could imagine advantages to directly looking for a highly efficient procedure. But right now, it makes sense to me to basically split up the problem into “find any tractable procedure at all” (e.g., debate) and “if necessary, distill it into a more efficient model safely.”
It’s worth noting that in cases where you care about average case performance, you can always distill the behavior back into the model. So, average case usage can always be equivalent to generating training or reward data in my view.
I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)
My non-answer to (2) would be that debate could be used in all of these ways, and the central problem it’s trying to solve is sort of orthogonal to how exactly it’s being used. (Also, the best way to use it might depend on the context.)
What debate is trying to do is let you evaluate plans/actions/outputs that an unassisted human couldn’t evaluate correctly (in any reasonable amount of time). You might want to use that to train a reward model (replacing humans in RLHF) and then train a policy; this would most likely be necessary if you want low cost at inference time. But it also seems plausible that you’d use it at runtime if inference costs aren’t a huge bottleneck and you’d rather get some performance or safety boost from avoiding distillation steps.
I think the problem of “How can we evaluate outputs that a single human can’t feasibly evaluate?” is pretty reasonable to study independently, agnostic to how you’ll use this evaluation procedure. The main variable is how efficient the evaluation procedure needs to be, and I could imagine advantages to directly looking for a highly efficient procedure. But right now, it makes sense to me to basically split up the problem into “find any tractable procedure at all” (e.g., debate) and “if necessary, distill it into a more efficient model safely.”
It’s worth noting that in cases where you care about average case performance, you can always distill the behavior back into the model. So, average case usage can always be equivalent to generating training or reward data in my view.
I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)