Do you mean that you expect OpenAI deliberately wrote training examples for GPT based on Gary Marcus’s questions, or only that because Marcus’s examples are on the internet and any sort of “scrape the whole web” process will have pulled them in?
The former would surely lead to GPT-4 doing better on those examples. I’m not sure the latter would. Scott’s and Marcus’s blog posts, for instance, contain GPT-3′s continuations for those examples; they don’t contain better continuations. Maybe a blog post saying “ha ha, given prompt X GPT continued it to make X Y; how stupid” is enough for the training process to make GPT give better answers when prompted with X, but it’s not obvious that it would be. (On the face of it, it would e.g. mean that GPT is learning more intelligently from its training data than would be implied by the sort of stochastic-parrot model some have advocated. My reading of what Marcus wrote is that he takes basically that view: “What it does is something like a massive act of cutting and pasting, stitching variations on text that it has seen”, “GPT-3 continues with the phrase “You are now dead” because that phrase (or something like it) often follows phrases like “… so you can’t smell anything. You are very thirsty. So you drink it.””, “It learns correlations between words, and nothing more”.)
I don’t know anything about how OpenAI actually select their training data, and in particular don’t know whether they deliberately seed it with things that they hope will fix specific flaws identified by their critics. So the first scenario is very possible, and so I agree that testing different-but-similar examples would give more trustworthy evidence about whether GPT-4 is really smarter in the relevant ways than GPT-3. But if I had to guess, I would guess that they don’t deliberately seed their training data with their critics’ examples, and that GPT-4 will do about equally well on other examples of difficulty similar to the ones Marcus posted.
(I don’t have access to GPT-4 myself, so can’t test this myself.)
Do you mean that you expect OpenAI deliberately wrote training examples for GPT based on Gary Marcus’s questions, or only that because Marcus’s examples are on the internet and any sort of “scrape the whole web” process will have pulled them in?
Surely Column B, and maybe a bit of Column A. But even if the researchers didn’t “cheat” by specifically fine-tuning the model on tasks they had someone helpfully point out that it failed at, I think the likelihood of the model picking up on the same exact pattern that appeared once verbatim in its training set isn’t zero. So something to test would be to diversify the problems a bit along similar lines to see how well it generalizes (note that I generally agree with you on the goalpost moving that always happens with these things; but let’s try being rigorous and playing Devil’s advocate when running any sort of test with a pretence of scientificity).
Do you mean that you expect OpenAI deliberately wrote training examples for GPT based on Gary Marcus’s questions, or only that because Marcus’s examples are on the internet and any sort of “scrape the whole web” process will have pulled them in?
The former would surely lead to GPT-4 doing better on those examples. I’m not sure the latter would. Scott’s and Marcus’s blog posts, for instance, contain GPT-3′s continuations for those examples; they don’t contain better continuations. Maybe a blog post saying “ha ha, given prompt X GPT continued it to make X Y; how stupid” is enough for the training process to make GPT give better answers when prompted with X, but it’s not obvious that it would be. (On the face of it, it would e.g. mean that GPT is learning more intelligently from its training data than would be implied by the sort of stochastic-parrot model some have advocated. My reading of what Marcus wrote is that he takes basically that view: “What it does is something like a massive act of cutting and pasting, stitching variations on text that it has seen”, “GPT-3 continues with the phrase “You are now dead” because that phrase (or something like it) often follows phrases like “… so you can’t smell anything. You are very thirsty. So you drink it.””, “It learns correlations between words, and nothing more”.)
I don’t know anything about how OpenAI actually select their training data, and in particular don’t know whether they deliberately seed it with things that they hope will fix specific flaws identified by their critics. So the first scenario is very possible, and so I agree that testing different-but-similar examples would give more trustworthy evidence about whether GPT-4 is really smarter in the relevant ways than GPT-3. But if I had to guess, I would guess that they don’t deliberately seed their training data with their critics’ examples, and that GPT-4 will do about equally well on other examples of difficulty similar to the ones Marcus posted.
(I don’t have access to GPT-4 myself, so can’t test this myself.)
Surely Column B, and maybe a bit of Column A. But even if the researchers didn’t “cheat” by specifically fine-tuning the model on tasks they had someone helpfully point out that it failed at, I think the likelihood of the model picking up on the same exact pattern that appeared once verbatim in its training set isn’t zero. So something to test would be to diversify the problems a bit along similar lines to see how well it generalizes (note that I generally agree with you on the goalpost moving that always happens with these things; but let’s try being rigorous and playing Devil’s advocate when running any sort of test with a pretence of scientificity).