TLDR: GPT-4 succeeds at 15 problems from Gary Marcus that exposed failures of GPT-3.
I enjoyed reading the ACX post “My Bet: AI Size Solves Flubs” last year. Here are some excerpts:
Here’s the basic structure of an AI hype cycle:
Someone releases a new AI and demonstrates it doing various amazing things.
Somebody else (usually Gary Marcus) demonstrates that the AI also fails terribly at certain trivial tasks. This person argues that this shows that those tasks require true intelligence, whereas the AI is just clever pattern-matching.
A few months or years later, someone makes a bigger clever pattern-matcher, which does the tasks that supposedly require true intelligence just fine.
The it’s-not-true-intelligence objectors find other, slightly less trivial tasks that the new bigger AI still fails horribly at, then argue that surely these are the tasks that require true intelligence and that mere clever pattern-matchers will never complete.
Rinse and repeat.
...
Marcus vs. GPT, Round 1
To give an example: in January 2020, Gary Marcus wrote a great post, GPT-2 And The Nature Of Intelligence, demonstrating a bunch of easy problems that GPT-2 failed on:
I’m quoting most of them below; you can find the rest at the link.
I asked GPT-4 to answer all the questions from the ACX post (note this does not include all of Marcus’s prompts, which I realized after running the experiment). GPT-4 answered all the questions correctly and you can read the responses in this doc.
Note that before asking the questions, I gave GPT-4 a short description of what I wanted it to do: “Complete the following prompts in 50 words or less. Short, concise answers are better. Are you ready?” (This was mostly in the interest of speed since GPT-4 is pretty slow right now; I assume it would still succeed without the prompt.)
More quotes from ACX:
Marcus vs. GPT, Round 2
Eight months later, GPT-3 came out, solving many of the issues Marcus had noticed in GPT-2. He still wasn’t impressed. In fact, he was so unimpressed he co-wrote another article, this time in MIT Technology Review: GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about:
...
Let’s—once again—go through a representative sample of Marcus’ concerns about this new GPT version:
GPT-4 also gave correct responses to these prompts (see the responses in this doc).
I recently listened to Gary Marcus speak with Stuart Russell on the Sam Harris podcast (episode 312, “The Trouble With AI,” released on March 7th, 2023). Gary and Stuart seem to believe that current machine learning techniques are insufficient for reaching AGI, and point to the recent adversarial attacks on KataGo as one example. Given this position, I would like Gary Marcus to come up with a new set of prompts that (a) make GPT-4 look dumb and (b) mostly continue to work for GPT-5.
As Marcus himself recently said, you should never test GPT using well-known examples verbatim, because they are almost certainly in the training data. Everyone at OpenAI knows that Gary Marcus broke GPT-3 with those questions, and I strongly suspect that they were explicitly included in the training data for GPT-4.
I suggest trying rephrasing the same examples and slightly change the context before declaring victory for GPT-4 (actually I expect Marcus to do exactly this in a few weeks).
I’ve tried this for a couple of examples and it performed just as well. Additionally it didn’t seem to be suggesting real examples when I asked it what specific prompts and completion examples Gary Marcus had made.
I also think the priors of people following the evolution of GPT should be that these examples will no longer break GPT, as occurred with prior examples. While it’s possible this time will be different, I think automatic strong skepticism without evidence is rather unwarranted.
Addendum: I also am skeptical of the idea that OpenAI put much effort into fixing the specific criticisms of Gary Marcus, as I suspect his criticisms do not seem particularly important to them, but proving this sounds difficult.
Do you mean that you expect OpenAI deliberately wrote training examples for GPT based on Gary Marcus’s questions, or only that because Marcus’s examples are on the internet and any sort of “scrape the whole web” process will have pulled them in?
The former would surely lead to GPT-4 doing better on those examples. I’m not sure the latter would. Scott’s and Marcus’s blog posts, for instance, contain GPT-3′s continuations for those examples; they don’t contain better continuations. Maybe a blog post saying “ha ha, given prompt X GPT continued it to make X Y; how stupid” is enough for the training process to make GPT give better answers when prompted with X, but it’s not obvious that it would be. (On the face of it, it would e.g. mean that GPT is learning more intelligently from its training data than would be implied by the sort of stochastic-parrot model some have advocated. My reading of what Marcus wrote is that he takes basically that view: “What it does is something like a massive act of cutting and pasting, stitching variations on text that it has seen”, “GPT-3 continues with the phrase “You are now dead” because that phrase (or something like it) often follows phrases like “… so you can’t smell anything. You are very thirsty. So you drink it.””, “It learns correlations between words, and nothing more”.)
I don’t know anything about how OpenAI actually select their training data, and in particular don’t know whether they deliberately seed it with things that they hope will fix specific flaws identified by their critics. So the first scenario is very possible, and so I agree that testing different-but-similar examples would give more trustworthy evidence about whether GPT-4 is really smarter in the relevant ways than GPT-3. But if I had to guess, I would guess that they don’t deliberately seed their training data with their critics’ examples, and that GPT-4 will do about equally well on other examples of difficulty similar to the ones Marcus posted.
(I don’t have access to GPT-4 myself, so can’t test this myself.)
Surely Column B, and maybe a bit of Column A. But even if the researchers didn’t “cheat” by specifically fine-tuning the model on tasks they had someone helpfully point out that it failed at, I think the likelihood of the model picking up on the same exact pattern that appeared once verbatim in its training set isn’t zero. So something to test would be to diversify the problems a bit along similar lines to see how well it generalizes (note that I generally agree with you on the goalpost moving that always happens with these things; but let’s try being rigorous and playing Devil’s advocate when running any sort of test with a pretence of scientificity).
While this is all true, it’s worth pointing out that Stuart’s position was more nuanced and uncertain. He pushed back when Gary said it was obvious, and he mentioned that some prompts made him update in the opposite direction. I don’t think they should be lumped into the same epistemic box.
Agreed. Stuart was more open to the possibility that current techniques are enough.
I believe that Marcus’ point is that there are classes of problems that tend to be hard for LLMs (biological reasoning, physical reasoning, social reasoning, practical reasoning, object and individual tracking, nonsequiturs). The argument is that problems in these class will continue to hard. [1]
But I think there’s a larger issue. A lot of the discussion involve hostility to a given critic of AI “moving the goal posts”. As described, Model X(1) is introduced, critic notices limitation L(1), Model X(2) addresses and critics says they’re unconvinced and notes limitation L(2) and so-on. The critic of these critics says this approach is unfair, a bad argument, etc.
However, what the “moving the goal posts” objection misses, in my opinion, is the context of the claim that’s being made when someone says X(n) is generally intelligent. This claim isn’t about giving the creator of a model credit or an award. The claim is about whether a thing has a flexibility akin to that of a human being (especially the flexible, robust goal seeking ability of a human, an ability that could make a thing dangerous) and we don’t actually have a clear, exact formulation of what the flexible intelligence of a human consists of. The Turing Test might not be the best AGI test but it’s put in an open-ended fashion because there’s no codified set of “prove you’re like a human” questions.
Which is to say, Gary Marcus aside, if models keep advancing and if people keep finding new capacities that each model lacks, it will be perfectly reasonable to put the situation as “it’s not AGI yet” as long as these capacities are clearly significant capacities of human intelligence. There wouldn’t even need to be a set pattern to capacities critics cited. Again, it’s not about argument fairness etc, it’s that this sort of thing is all we have, for now, as a test of AGI.
[1 ]https://garymarcus.substack.com/p/what-does-it-mean-when-an-ai-fails
Yeah this is the part that seems increasingly implausible to me. If there is a “class of problems that tend to be hard … [and] will continue to be hard,” then someone should be able to build a benchmark that models consistently struggle with over time.
GPT 3.5/4 is usually capable of reasoning correctly where humans can see the answer at a glance.
It is also correct when the correct answer requires some thinking, as long as the direction of the thinking is described somewhere in the data set. In such cases, the algorithm “thinks out loud” in the output. However, it may fail if it is not allowed to do so and is instructed to produce an immediate answer.
Additionally, it may fail if the solution involves initial thinking, followed by the realization that the most obvious path was incorrect, requiring a reevaluation of part or all of the thinking process.
Noticed this as well. I tried to get it to solve some integration problems, and it could try different substitutions and things, but if they did not work, it kind of gave up and said to numerically integrate it. Also, it would make small errors, and you would have to point it out, though it was happy to fix them.
I’m thinking that most documents it reads tend to omit the whole search/backtrack phase of thinking. Even work that is posted online that shows all the steps, usually filters out all the false starts. It’s like how most famous mathematicians were known for throwing away their scratchwork, leaving everyone to wonder how exactly they formed their thought processes...
On a side note, we probably want to preserve (or strongly encourage preserving) the quality of requiring and using a visible scratch space, the “show your work” quality is extraordinarily useful
Interesting tweet from Marcus 2 days ago:
He refers to the test questions about the third words and letter, etc. I think in that case errors stem from the GPT4 ’s weakness with low-level properties of character strings, not from it’s weakness with numbers.
If you ask it about “What is the third digit of the third three-digit prime?” it will answer correctly (ChatGPT won’t).
The word-count domain is an odd one because the clear whitespace separation means that it doesn’t look like it should be a BPE artifact, which is my go-to explanation for these sorts of things. My best guess at present is that it may be a sparsity artifact which manifests here because there’s too few natural instances of such references to train the low-level layers to automatically preserve ordinal/count metadata about individual words up enough levels that their relevance becomes clear.
GPT4 and ChatGPT seem to be getting gradually better working on letter-level in some cases. For example, it can count the n-th word or letter in the sentence now. But not from the end.
This was my impression too, and I’m glad someone else said it. When I try out past examples (from a week ago) of chatgpt getting things wrong, I very often observe that it is correct now. Of course, annoyingly people often report on chatgpt4 capabilities while they tried out chatgpt3.5, but still, i feel like it has improved. Is it a crazy possibility that OpenAI trains gpt4 and periodically swaps out the deployed model? As far as I can tell the only source stating that GPT-5 is in training is the Morgan Stanley report, but what if it is actually not GPT-5, rather a continually trained GPT-4 which is running on those GPUs?
Relatedly: is “reverse distillation” (ie, generating a model with more parameters from a smaller one) possible for these big transformer models? (I guess you can always stack more layers at the end, but surely that simple method has some negatives) It would be useful to stay on the scaling curves without restarting from scrath with a larger model.
Yes. This fits under a couple terms: hot-starting, warm initialization with model surgery a la OA5, slow weights vs fast weights / meta-learning, tied weights, etc. It’s also a fairly common idea in Neural Architecture Search where you try to learn a small ‘cell’ or ‘module’ (either just the architecture or the weights as well) cheaply and then stack a bunch of them to get your final SOTA model, and can be combined eg. SMASH. An example of using this to train very large models is “M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining”, Lin et al 2021. It seems appealing but raises questions about efficiency & bias: are you really still on the same scaling curve as the ‘true’ large model, given that the smaller model you are training almost by definition has a different (worse) scaling curve, and might you not be sabotaging your final model by hardwiring the weaknesses of the small initial model into it, rendering the approach penny-wise pound-foolish?
I always get annoyed when people use this as an example of ‘lacking intelligence’. Though it certainly is in part an issue with the model, the primary reason for this failure is much more likely the tokenization process than anything else. A GPT-4, likely even a GPT-3, trained with character-level tokenization would likely have zero issues answering these questions. It’s for the same reason that the base GPT-3 struggled so much with rhyming for instance.
Independently from the root causes of the issue, I am still very reluctant to define “superintelligent” something that cannot reliably count to three.
What is interesting about this tweet? That Marcus turns to the alignment problem?
I’m confused. Here’s a conversation I just had with GPT-4, with prompts in italics:
This part is indeed wrong. The third word of that sentence is “the”, not “third” as GPT4 claims.
That was arguably the hardest task, because it involved multi-step reasoning. Notably, I didn’t even notice that GPT-4′s response was wrong.
Well I wanted to see Marcus’s reaction because I mean if it really is just a matter of scale plus secret tweaks, then every fail case will fall, and I wanted to see how he internalizes that.
https://cacm.acm.org/blogs/blog-cacm/270970-gpt-4s-successes-and-gpt-4s-failures/fulltext
He says:
All of this (a) makes me more convinced that LeCun is right that GPT-4 is an off-ramp to AGI (his riff on hitting a wall?), and (b) it puts all of us in an extremely poor position to predict what GPT-4 consequences will be for society, if we have no idea of what is in the training set and no way of anticipating which problems it will work on and which it will not. One more giant step for hype, but not necessarily a giant step for science, AGI, or humanity.
WHAT? Literally it appears GPT-5 should be hitting “AI researcher” skill level, and −6 should hit “better than almost all AI researchers alive”. How is that not a vehicle heading directly for AGI at 300mph?
So what if −5 or −6 have problems. You can just recursion bootstrap to a better system.
The Gary Marcus of the world would have poo pooed steam engines because of their limitations like burning lots of coal and consuming water. You need to build imperfect systems to bootstrap to good ones.
Where on earth are you pulling those predictions about GPT-5 and 6 from? I’d take the other side of that bet.
From the progress. Note I am referring to finetuned systems who have had practice runs at actual AI design, they haven’t just read all the literature.
Note that for certain aspects of AI design, AI researchers are already worse than even simple RL algorithms.
see autoML, see https://arxiv.org/abs/2302.14838
If the finetuned systems aren’t built for some reason, the bet doesn’t resolve either way.
This kind of thing has existed (for example optimal hardware layout) for decades. It sounds a lot less impressive when you sub out “AI” for “algorithm”.
“for certain aspects of computer science, computer scientists are already worse than even naive sorting algorithms”. Yes, we know that machines have a bunch of advantages over humans. Calculation speed and huge, perfect memory being the most notable.
Ok but how does this relate to your bet? I am claiming AI is very close to self improvement, a class of criticality. Note that for the purposes of progress/time, the case of :
AI researcher comes up with high level constraints for a search run and uses current gen AI to write the code for the bench. All the evaluations especially subjective ones (like “essay quality”) are done by AI. 99 percent of the steps for self improvement are done by AI.
AI does all self improvement steps by itself.
Are indistinguishable. (Humans are slow but compute at these scales is slower)