The goal with debate is to scale to situations where the debaters are much more capable than the judge, see AI safety via debate for discussion of why this seems plausible.
My apologies I didn’t frame my question correctly.
Our current work is looking into training our LLM judges to be better proxies of human judges
My understanding from this statement is that the team plans to finetune Weak LLMs on human judges and then use them as a judge for Strong LLM Debates. This makes sense right now, when human judges are able to assess Strong LLM Debates fairly robustly.
What happens when we want to use a Weak LLM as a judge but there is no accurate or good enough human judge? At that point we won’t be able to finetune the Weak LLM because there is no good human judge. Do we assume that at that stage the Weak LLM itself will be pretty robust?
Oh I see. The main reason we’re training weak LLMs as judges right now is because it lets us iterate faster on our research (relative to using human judges). But we’re imagining having human judges when aligning a model in practice.
(To be clear, I could imagine that we use LLMs as judges even when aligning a model in practice, but we would want to see significantly more validation of the LLM judges first.)
Ah that makes sense, thank you. Did the team also ensure that there wasn’t any data leakage between the tasks being evaluated and the pretraining data? For context, I’m thinking of replicating the results with Llama so wondering about the same.
I don’t know for sure, but I doubt we checked that in any depth. It would be quite hard to do, and doesn’t seem that important for our purposes, since we’re comparing different post-training algorithms (so pretraining data leakage would affect all of them, hopefully to similar extents).
Oh that’s interesting. Wouldn’t that slightly bias the results? For eg. the paper claims no advantage of debate over QA without article. Intuitively if the weak LLM isn’t pretrained on QA without article then debate should work better than consultancy. On the other hand, if it is, then intuitively there should be no difference between Debate and Consultancy which is what the team observes. Wdyt?
It clearly can’t be having a large effect, since the accuracies aren’t near-100% for any of the methods. I agree leakage would have some effect. The mechanism you suggest is plausible, but it can’t be the primary cause of the finding that debate doesn’t have an advantage—since accuracies aren’t near-100% we know there are some cases the model hasn’t memorized, so the mechanism you suggest doesn’t apply to those inputs.
More generally, all sorts of things have systematic undesired effects on our results, aka biases. E.g. I suspect the prompts are a bigger deal. Basically any empirical paper will be subject to the critique that aspects of the setup introduce biases.
since accuracies aren’t near-100% we know there are some cases the model hasn’t memorized, so the mechanism you suggest doesn’t apply to those inputs
That makes sense.
I suspect the prompts are a bigger deal
Do you suppose a suitable proxy for prompt quality can be replicating these experiments with LLM debaters/judges of different sizes? Let’s say P is the optimal prompt and Q is a suboptimal one, then LLM performance with prompt Q ⇐ LLM performance with prompt P ⇐ bigger LLM performance with prompt Q.
The goal with debate is to scale to situations where the debaters are much more capable than the judge, see AI safety via debate for discussion of why this seems plausible.
My apologies I didn’t frame my question correctly.
My understanding from this statement is that the team plans to finetune Weak LLMs on human judges and then use them as a judge for Strong LLM Debates. This makes sense right now, when human judges are able to assess Strong LLM Debates fairly robustly.
What happens when we want to use a Weak LLM as a judge but there is no accurate or good enough human judge? At that point we won’t be able to finetune the Weak LLM because there is no good human judge. Do we assume that at that stage the Weak LLM itself will be pretty robust?
Oh I see. The main reason we’re training weak LLMs as judges right now is because it lets us iterate faster on our research (relative to using human judges). But we’re imagining having human judges when aligning a model in practice.
(To be clear, I could imagine that we use LLMs as judges even when aligning a model in practice, but we would want to see significantly more validation of the LLM judges first.)
Ah that makes sense, thank you.
Did the team also ensure that there wasn’t any data leakage between the tasks being evaluated and the pretraining data? For context, I’m thinking of replicating the results with Llama so wondering about the same.
I don’t know for sure, but I doubt we checked that in any depth. It would be quite hard to do, and doesn’t seem that important for our purposes, since we’re comparing different post-training algorithms (so pretraining data leakage would affect all of them, hopefully to similar extents).
Oh that’s interesting. Wouldn’t that slightly bias the results? For eg. the paper claims no advantage of debate over QA without article. Intuitively if the weak LLM isn’t pretrained on QA without article then debate should work better than consultancy. On the other hand, if it is, then intuitively there should be no difference between Debate and Consultancy which is what the team observes. Wdyt?
It clearly can’t be having a large effect, since the accuracies aren’t near-100% for any of the methods. I agree leakage would have some effect. The mechanism you suggest is plausible, but it can’t be the primary cause of the finding that debate doesn’t have an advantage—since accuracies aren’t near-100% we know there are some cases the model hasn’t memorized, so the mechanism you suggest doesn’t apply to those inputs.
More generally, all sorts of things have systematic undesired effects on our results, aka biases. E.g. I suspect the prompts are a bigger deal. Basically any empirical paper will be subject to the critique that aspects of the setup introduce biases.
That makes sense.
Do you suppose a suitable proxy for prompt quality can be replicating these experiments with LLM debaters/judges of different sizes? Let’s say P is the optimal prompt and Q is a suboptimal one, then LLM performance with prompt Q ⇐ LLM performance with prompt P ⇐ bigger LLM performance with prompt Q.