Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set.
Among them:
Crucial context missing from questions (apparently copy-paste errors?)
Ambiguous sets of answers
Wrong sets of answers
He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models.
I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.
Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I’ve seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).
As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.
Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why?
For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise “fine” under those conditions (if, for example, we could somehow get that number down to 0%)?
It seems like he’s mainly responding to the implication that this means MMLU is “broken”. Label noise can be both suboptimal and also much less important than this post’s title suggests.
I imagine researchers at big labs know this and are correcting these errors as models get good enough for this to matter.