An incentive to make the existing SOTA seem as bad as possible, to maximize the gap between it and your own new, sparkly, putatively superior method.
Here’s an eyerolling example from yesterday or so: Delphi boasts about their new ethics dataset of n=millions & model which gets 91% vs GPT-3 at chance-level of 52%. Wow, how awful! But wait, we know GPT-3 does better than chance on other datasets like Hendrycks’s ETHICS, how can it do so bad where a much smaller model can do so well?
Oh, it turns out that that’s zeroshot with their idiosyncratic format. The abstract just doesn’t mention that when they do some basic prompt engineering (no p-tuning or self-distillation or anything) and include a few examples (ie. a lot fewer than ‘millions’), it gets more like… 84%. Oh.
Here’s an eyerolling example from yesterday or so: Delphi boasts about their new ethics dataset of n=millions & model which gets 91% vs GPT-3 at chance-level of 52%. Wow, how awful! But wait, we know GPT-3 does better than chance on other datasets like Hendrycks’s ETHICS, how can it do so bad where a much smaller model can do so well?
Oh, it turns out that that’s zeroshot with their idiosyncratic format. The abstract just doesn’t mention that when they do some basic prompt engineering (no p-tuning or self-distillation or anything) and include a few examples (ie. a lot fewer than ‘millions’), it gets more like… 84%. Oh.