The part about the previous SOTA being fine-tuned GPT-2, which means a lot of MATH performance was latent in LMs that existed at the time we made the bet. On top of this, the various prompting and data-cleaning changes strike me as revealing latent capacity.
If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development. There were some slightly hard problems that the model was capable of doing, that Google highlighted in their paper (though they were cherry-picked)—and for that I did update by a bit (I said my timelines advanced by “a few years”).
>If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development.
I suppose I just have different intuitions on this. Let’s just make a second bet. I imagine you can find another element for your list you will be comfortable adding—it doesn’t necessarily have to be a dataset, just something in the same spirit as the other items in the list.
I think I’ll pass up an opportunity for a second bet for now. My mistake was being too careless in the first place—and I’m not currently too interested in doing a deeper dive into what might be a good replacement for MATH.
I’m confused. I am not saying that, so I’m not sure which part of my comment you’re agreeing with.
If I found something, I’d be sympathetic to taking another bet. Unfortunately I don’t know of any other good datasets.
The part about the previous SOTA being fine-tuned GPT-2, which means a lot of MATH performance was latent in LMs that existed at the time we made the bet. On top of this, the various prompting and data-cleaning changes strike me as revealing latent capacity.
If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development. There were some slightly hard problems that the model was capable of doing, that Google highlighted in their paper (though they were cherry-picked)—and for that I did update by a bit (I said my timelines advanced by “a few years”).
>If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development.
I suppose I just have different intuitions on this. Let’s just make a second bet. I imagine you can find another element for your list you will be comfortable adding—it doesn’t necessarily have to be a dataset, just something in the same spirit as the other items in the list.
I think I’ll pass up an opportunity for a second bet for now. My mistake was being too careless in the first place—and I’m not currently too interested in doing a deeper dive into what might be a good replacement for MATH.
You could just drop MATH and make a bet at different odds on the remaining items.