gwern comments on A concrete bet offer to those with short AGI timelines

gwern 22 May 2024 19:29 UTC
3 points
−1
Also an issue is that if MATH is contaminated, you’d think GSM8k would be contaminated too, but Scale just made a GSM1k and in it, GPT/Claude are minimally overfit (although in both of these papers, the Chinese & Mistral models usually appear considerably more overfit than GPT/Claude). Note that Scale made extensive efforts to equalize difficulty and similarity of the GSM1k with GSM8k, which this Consequent AI paper on MATH does not, and discussed the methodological issues which complicate re-benchmarking.