Do you mean that you expect OpenAI deliberately wrote training examples for GPT based on Gary Marcus’s questions, or only that because Marcus’s examples are on the internet and any sort of “scrape the whole web” process will have pulled them in?
Surely Column B, and maybe a bit of Column A. But even if the researchers didn’t “cheat” by specifically fine-tuning the model on tasks they had someone helpfully point out that it failed at, I think the likelihood of the model picking up on the same exact pattern that appeared once verbatim in its training set isn’t zero. So something to test would be to diversify the problems a bit along similar lines to see how well it generalizes (note that I generally agree with you on the goalpost moving that always happens with these things; but let’s try being rigorous and playing Devil’s advocate when running any sort of test with a pretence of scientificity).
Surely Column B, and maybe a bit of Column A. But even if the researchers didn’t “cheat” by specifically fine-tuning the model on tasks they had someone helpfully point out that it failed at, I think the likelihood of the model picking up on the same exact pattern that appeared once verbatim in its training set isn’t zero. So something to test would be to diversify the problems a bit along similar lines to see how well it generalizes (note that I generally agree with you on the goalpost moving that always happens with these things; but let’s try being rigorous and playing Devil’s advocate when running any sort of test with a pretence of scientificity).