They don’t re-baseline with humans. (Based on my last skim a while ago.)
It is easy to make math problems considerably harder by changing the constants and often math problems are designed to make the constants easy to work with.
Both humans and AI are used to constants which are chosen to be nice for math problems (obviously this is unrealistic for real problems, but nonetheless this doesn’t clearly show dataset contamination). AIs might be more sensitive to this.
Also an issue is that if MATH is contaminated, you’d think GSM8k would be contaminated too, but Scale just made a GSM1k and in it, GPT/Claude are minimally overfit (although in both of these papers, the Chinese & Mistral models usually appear considerably more overfit than GPT/Claude). Note that Scale made extensive efforts to equalize difficulty and similarity of the GSM1k with GSM8k, which this Consequent AI paper on MATH does not, and discussed the methodological issues which complicate re-benchmarking.
Significant evidence for data contamination of MATH benchmark: https://arxiv.org/abs/2402.19450
I’m not sold this shows dataset contamination.
They don’t re-baseline with humans. (Based on my last skim a while ago.)
It is easy to make math problems considerably harder by changing the constants and often math problems are designed to make the constants easy to work with.
Both humans and AI are used to constants which are chosen to be nice for math problems (obviously this is unrealistic for real problems, but nonetheless this doesn’t clearly show dataset contamination). AIs might be more sensitive to this.
(I agree it is some evidence for contamination.)
Also an issue is that if MATH is contaminated, you’d think GSM8k would be contaminated too, but Scale just made a GSM1k and in it, GPT/Claude are minimally overfit (although in both of these papers, the Chinese & Mistral models usually appear considerably more overfit than GPT/Claude). Note that Scale made extensive efforts to equalize difficulty and similarity of the GSM1k with GSM8k, which this Consequent AI paper on MATH does not, and discussed the methodological issues which complicate re-benchmarking.