I think that one of the key difficulties for debate research is having good tasks that call for more sophisticated protocols. I think this dataset seems great for that purpose, and having established a negative result for 1-turn debate seems like a good foundation for follow-up work exploring more sophisticated protocols. (It seems like a shame that people don’t normally publish early-stage and negative results.)
In comparison with other datasets (e.g. in the negative results described by Beth), it seems like QuALITY is identifying pretty crisp failures and is within striking distance for modern ML. I haven’t looked at the dataset beyond the samples in the paper but tentatively I’m pretty excited about more people working on it (and excited to see future work from your group!)
I do strongly suspect that multi-turn debates could handle these questions, and if not it would be a pretty significant update about debate / the nature of human reasoning / etc. I think it’s possible those debates would have to get pretty complicated, and it’s also quite plausible that it will be easier to get something else to work. In any case, i feel like the problem is a close enough match for what we care about that doing “whatever it takes” will probably generally be pretty interesting.
I think that one of the key difficulties for debate research is having good tasks that call for more sophisticated protocols. I think this dataset seems great for that purpose, and having established a negative result for 1-turn debate seems like a good foundation for follow-up work exploring more sophisticated protocols. (It seems like a shame that people don’t normally publish early-stage and negative results.)
In comparison with other datasets (e.g. in the negative results described by Beth), it seems like QuALITY is identifying pretty crisp failures and is within striking distance for modern ML. I haven’t looked at the dataset beyond the samples in the paper but tentatively I’m pretty excited about more people working on it (and excited to see future work from your group!)
I do strongly suspect that multi-turn debates could handle these questions, and if not it would be a pretty significant update about debate / the nature of human reasoning / etc. I think it’s possible those debates would have to get pretty complicated, and it’s also quite plausible that it will be easier to get something else to work. In any case, i feel like the problem is a close enough match for what we care about that doing “whatever it takes” will probably generally be pretty interesting.