Dan H comments on Broken Benchmark: MMLU

Dan H 30 Aug 2023 2:11 UTC
23 points
10
Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I’ve seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).
- alenoach 30 Aug 2023 23:47 UTC
  1 point
  2
  Parent
  As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.
- awg 30 Aug 2023 15:47 UTC
  1 point
  −2
  Parent
  Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why?
  For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise “fine” under those conditions (if, for example, we could somehow get that number down to 0%)?
  - Richard_Ngo 30 Aug 2023 23:43 UTC
    6 points
    4
    Parent
    It seems like he’s mainly responding to the implication that this means MMLU is “broken”. Label noise can be both suboptimal and also much less important than this post’s title suggests.
  - O O 30 Aug 2023 16:59 UTC
    1 point
    −2
    Parent
    I imagine researchers at big labs know this and are correcting these errors as models get good enough for this to matter.