alenoach comments on Broken Benchmark: MMLU

alenoach 30 Aug 2023 23:47 UTC
1 point
2
As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.