As Marcus himself recently said, you should never test GPT using well-known examples verbatim, because they are almost certainly in the training data. Everyone at OpenAI knows that Gary Marcus broke GPT-3 with those questions, and I strongly suspect that they were explicitly included in the training data for GPT-4.
I suggest trying rephrasing the same examples and slightly change the context before declaring victory for GPT-4 (actually I expect Marcus to do exactly this in a few weeks).
I’ve tried this for a couple of examples and it performed just as well. Additionally it didn’t seem to be suggesting real examples when I asked it what specific prompts and completion examples Gary Marcus had made.
I also think the priors of people following the evolution of GPT should be that these examples will no longer break GPT, as occurred with prior examples. While it’s possible this time will be different, I think automatic strong skepticism without evidence is rather unwarranted.
Addendum: I also am skeptical of the idea that OpenAI put much effort into fixing the specific criticisms of Gary Marcus, as I suspect his criticisms do not seem particularly important to them, but proving this sounds difficult.
Do you mean that you expect OpenAI deliberately wrote training examples for GPT based on Gary Marcus’s questions, or only that because Marcus’s examples are on the internet and any sort of “scrape the whole web” process will have pulled them in?
The former would surely lead to GPT-4 doing better on those examples. I’m not sure the latter would. Scott’s and Marcus’s blog posts, for instance, contain GPT-3′s continuations for those examples; they don’t contain better continuations. Maybe a blog post saying “ha ha, given prompt X GPT continued it to make X Y; how stupid” is enough for the training process to make GPT give better answers when prompted with X, but it’s not obvious that it would be. (On the face of it, it would e.g. mean that GPT is learning more intelligently from its training data than would be implied by the sort of stochastic-parrot model some have advocated. My reading of what Marcus wrote is that he takes basically that view: “What it does is something like a massive act of cutting and pasting, stitching variations on text that it has seen”, “GPT-3 continues with the phrase “You are now dead” because that phrase (or something like it) often follows phrases like “… so you can’t smell anything. You are very thirsty. So you drink it.””, “It learns correlations between words, and nothing more”.)
I don’t know anything about how OpenAI actually select their training data, and in particular don’t know whether they deliberately seed it with things that they hope will fix specific flaws identified by their critics. So the first scenario is very possible, and so I agree that testing different-but-similar examples would give more trustworthy evidence about whether GPT-4 is really smarter in the relevant ways than GPT-3. But if I had to guess, I would guess that they don’t deliberately seed their training data with their critics’ examples, and that GPT-4 will do about equally well on other examples of difficulty similar to the ones Marcus posted.
(I don’t have access to GPT-4 myself, so can’t test this myself.)
Do you mean that you expect OpenAI deliberately wrote training examples for GPT based on Gary Marcus’s questions, or only that because Marcus’s examples are on the internet and any sort of “scrape the whole web” process will have pulled them in?
Surely Column B, and maybe a bit of Column A. But even if the researchers didn’t “cheat” by specifically fine-tuning the model on tasks they had someone helpfully point out that it failed at, I think the likelihood of the model picking up on the same exact pattern that appeared once verbatim in its training set isn’t zero. So something to test would be to diversify the problems a bit along similar lines to see how well it generalizes (note that I generally agree with you on the goalpost moving that always happens with these things; but let’s try being rigorous and playing Devil’s advocate when running any sort of test with a pretence of scientificity).
As Marcus himself recently said, you should never test GPT using well-known examples verbatim, because they are almost certainly in the training data. Everyone at OpenAI knows that Gary Marcus broke GPT-3 with those questions, and I strongly suspect that they were explicitly included in the training data for GPT-4.
I suggest trying rephrasing the same examples and slightly change the context before declaring victory for GPT-4 (actually I expect Marcus to do exactly this in a few weeks).
I’ve tried this for a couple of examples and it performed just as well. Additionally it didn’t seem to be suggesting real examples when I asked it what specific prompts and completion examples Gary Marcus had made.
I also think the priors of people following the evolution of GPT should be that these examples will no longer break GPT, as occurred with prior examples. While it’s possible this time will be different, I think automatic strong skepticism without evidence is rather unwarranted.
Addendum: I also am skeptical of the idea that OpenAI put much effort into fixing the specific criticisms of Gary Marcus, as I suspect his criticisms do not seem particularly important to them, but proving this sounds difficult.
Do you mean that you expect OpenAI deliberately wrote training examples for GPT based on Gary Marcus’s questions, or only that because Marcus’s examples are on the internet and any sort of “scrape the whole web” process will have pulled them in?
The former would surely lead to GPT-4 doing better on those examples. I’m not sure the latter would. Scott’s and Marcus’s blog posts, for instance, contain GPT-3′s continuations for those examples; they don’t contain better continuations. Maybe a blog post saying “ha ha, given prompt X GPT continued it to make X Y; how stupid” is enough for the training process to make GPT give better answers when prompted with X, but it’s not obvious that it would be. (On the face of it, it would e.g. mean that GPT is learning more intelligently from its training data than would be implied by the sort of stochastic-parrot model some have advocated. My reading of what Marcus wrote is that he takes basically that view: “What it does is something like a massive act of cutting and pasting, stitching variations on text that it has seen”, “GPT-3 continues with the phrase “You are now dead” because that phrase (or something like it) often follows phrases like “… so you can’t smell anything. You are very thirsty. So you drink it.””, “It learns correlations between words, and nothing more”.)
I don’t know anything about how OpenAI actually select their training data, and in particular don’t know whether they deliberately seed it with things that they hope will fix specific flaws identified by their critics. So the first scenario is very possible, and so I agree that testing different-but-similar examples would give more trustworthy evidence about whether GPT-4 is really smarter in the relevant ways than GPT-3. But if I had to guess, I would guess that they don’t deliberately seed their training data with their critics’ examples, and that GPT-4 will do about equally well on other examples of difficulty similar to the ones Marcus posted.
(I don’t have access to GPT-4 myself, so can’t test this myself.)
Surely Column B, and maybe a bit of Column A. But even if the researchers didn’t “cheat” by specifically fine-tuning the model on tasks they had someone helpfully point out that it failed at, I think the likelihood of the model picking up on the same exact pattern that appeared once verbatim in its training set isn’t zero. So something to test would be to diversify the problems a bit along similar lines to see how well it generalizes (note that I generally agree with you on the goalpost moving that always happens with these things; but let’s try being rigorous and playing Devil’s advocate when running any sort of test with a pretence of scientificity).