This looks great! Let me see if I understand the big picture correctly (omitting some of your experiments to just get the main thrust):
You finetuned GPT-3.5 alternatively to output the correct answer 1) on 500 MMLU questions, and 2) on debates between copies of GPT-4o on those same 500 MMLU questions. You found that validation accuracy judging debates was better than answering blind.
This difference did not appear when using GPT-4o as a judge, suggesting that in the capability asymmetric case, the judge learned to rely on the debate, whereas in the capability symmetric case, it did not.
Testing this trained judge with different debaters, you find that Elo of the debater models and accuracy of the debate result track well with each other. Strangely though, Best-of-4 decoding on the debaters does not seem to increase Elo?
This shows an example of a case where judge training in case of capability asymmetry actually seems to produce the desired behavior in debates (i.e., the judge relies on the debate and can use it to generalize well). Main issue that comes to mind:
I worry about how much of what we’re seeing is just an effect of domain shift. Since you trained the model on GPT-4o debates, I would expect the accuracy on these debates to be highest, and changing to GPT-4o mini and then GPT-3.5 should lead us further out of domain, reducing the judging model’s accuracy. Then the accuracy trend just reflects how OOD the debates are, and that happens to track with model skill for the debaters you tested. The fact that Elo also tracks in the expected way is a bit harder to explain away here, and makes it seem like the judge is learning something meaningful, but I am pretty unsure about that.
I think I would see these results as a lot stronger if BoN panned out and showed the expected Elo/accuracy relation, but it seems like it does not.
What do you think of this? Anything I’m wrong about or missing here?
Also low-level question. You say above the Elo/accuracy plot:
the Elo in the blue plot is only trained on GPT-4o best of 4 debates.
What does this mean? I would assume Elo needs to be computed by running a tournament between the models.
Testing this trained judge with different debaters, you find that Elo of the debater models and accuracy of the debate result track well with each other. Strangely though, Best-of-4 decoding on the debaters does not seem to increase Elo?
This is strange but the difference in Elo is actually not significant looking at the confidence intervals.
I worry about how much of what we’re seeing is just an effect of domain shift. Since you trained the model on GPT-4o debates, I would expect the accuracy on these debates to be highest, and changing to GPT-4o mini and then GPT-3.5 should lead us further out of domain, reducing the judging model’s accuracy. Then the accuracy trend just reflects how OOD the debates are, and that happens to track with model skill for the debaters you tested. The fact that Elo also tracks in the expected way is a bit harder to explain away here, and makes it seem like the judge is learning something meaningful, but I am pretty unsure about that.
The Elo of the debates stays roughly the same with an untrained judge. So another way you could plot this graph is by the having the accuracy of a judge trained only debates from that debater in the y-axis and then compute the Elo with an untrained debater on the x-axis and you would get roughly the same graph with the OOD issues.
the Elo in the blue plot is only trained on GPT-4o best of 4 debates.
What does this mean? I would assume Elo needs to be computed by running a tournament between the models.
Sorry, that’s a typo. It should say “the Elo in the blue plot is calculated only using a judge trained on GPT-4o best of 4 debates.” Otherwise, you’re understanding seems correct!
This looks great! Let me see if I understand the big picture correctly (omitting some of your experiments to just get the main thrust):
You finetuned GPT-3.5 alternatively to output the correct answer 1) on 500 MMLU questions, and 2) on debates between copies of GPT-4o on those same 500 MMLU questions. You found that validation accuracy judging debates was better than answering blind.
This difference did not appear when using GPT-4o as a judge, suggesting that in the capability asymmetric case, the judge learned to rely on the debate, whereas in the capability symmetric case, it did not.
Testing this trained judge with different debaters, you find that Elo of the debater models and accuracy of the debate result track well with each other. Strangely though, Best-of-4 decoding on the debaters does not seem to increase Elo?
This shows an example of a case where judge training in case of capability asymmetry actually seems to produce the desired behavior in debates (i.e., the judge relies on the debate and can use it to generalize well). Main issue that comes to mind:
I worry about how much of what we’re seeing is just an effect of domain shift. Since you trained the model on GPT-4o debates, I would expect the accuracy on these debates to be highest, and changing to GPT-4o mini and then GPT-3.5 should lead us further out of domain, reducing the judging model’s accuracy. Then the accuracy trend just reflects how OOD the debates are, and that happens to track with model skill for the debaters you tested. The fact that Elo also tracks in the expected way is a bit harder to explain away here, and makes it seem like the judge is learning something meaningful, but I am pretty unsure about that.
I think I would see these results as a lot stronger if BoN panned out and showed the expected Elo/accuracy relation, but it seems like it does not.
What do you think of this? Anything I’m wrong about or missing here?
Also low-level question. You say above the Elo/accuracy plot:
What does this mean? I would assume Elo needs to be computed by running a tournament between the models.
[Apologies for the really late response]
This is strange but the difference in Elo is actually not significant looking at the confidence intervals.
I worry about how much of what we’re seeing is just an effect of domain shift. Since you trained the model on GPT-4o debates, I would expect the accuracy on these debates to be highest, and changing to GPT-4o mini and then GPT-3.5 should lead us further out of domain, reducing the judging model’s accuracy. Then the accuracy trend just reflects how OOD the debates are, and that happens to track with model skill for the debaters you tested. The fact that Elo also tracks in the expected way is a bit harder to explain away here, and makes it seem like the judge is learning something meaningful, but I am pretty unsure about that.
The Elo of the debates stays roughly the same with an untrained judge. So another way you could plot this graph is by the having the accuracy of a judge trained only debates from that debater in the y-axis and then compute the Elo with an untrained debater on the x-axis and you would get roughly the same graph with the OOD issues.
the Elo in the blue plot is only trained on GPT-4o best of 4 debates.
What does this mean? I would assume Elo needs to be computed by running a tournament between the models.
Sorry, that’s a typo. It should say “the Elo in the blue plot is calculated only using a judge trained on GPT-4o best of 4 debates.” Otherwise, you’re understanding seems correct!