On the MMLU benchmark, Chinchilla five-shot reported 67.6% accuracy; how does one convert this to loss or vice versa? More to the point, what loss would the human expert 89.8% correspond to? It would be very interesting to see how much compute that scaling law predicts would be necessary to produce human expert level losses with optimal data availability, or with as much data as is likely available to such a project.
On the MMLU benchmark, Chinchilla five-shot reported 67.6% accuracy; how does one convert this to loss or vice versa? More to the point, what loss would the human expert 89.8% correspond to? It would be very interesting to see how much compute that scaling law predicts would be necessary to produce human expert level losses with optimal data availability, or with as much data as is likely available to such a project.