A Visual Task that’s Hard for GPT-4o, but Doable for Primary Schoolers

Sure, transformers can get silver at the IMO. But before we address those two remaining problems that still stand between AlphaProof and gold, let’s go a few steps back and see how transformers are doing at supposedly low-level math competitions. In particular, we will consider a problem aimed at grades 3 and 4, from the largest international math exam competition around (the Mathematical Kangaroo).

The problem is designed to test children’s spatial thinking skills and rightfully held up as a good example in an interview with the competition’s president Meike Akveld, a knot theorist at ETH. This article is where I came across the idea as well.

It goes like this:

In the following image, you can see an open picture book. There are holes in its cover. On the first page, there are pictures of objects. Which pictures are visible when the book is closed?

To remove some ambiguity, we also give the list of objects to choose from, so in this case "Ginger", "Kiwi", "Pineapple", "Pear", "Watermelon". OK then, GPT-4o, what do you think?

To determine which pictures will be visible when the book is closed, we need to align the holes on the cover with the pictures on the first page.

From the image:

1. The holes in the cover are located in the second and third columns from the left.
2. The pictures of objects on the first page — from left to right — are: pineapple, watermelon, ginger, kiwi, pear.

Thus, when the book is closed:
- The second column will show the first image: pineapple.
- The third column will show the second image: watermelon.

ANSWER: pineapple, watermelon

Hm, that’s wrong! The answer is {"Pineapple", "Ginger", "Pear"}. Then again, there are some good ideas in the completion, namely the basic setup of the problem being correctly summarized and the pictures identified.
Now, it is not so surprising that vision transformers are bad at grids, so let us grant the error about the hole locations. But the model also doesn’t understand the symmetry; that a hole further to the left reveals objects further to the right, which is what the task wants to test in children.

Investigating whether this failure mode repeats across different images and iterations (seemingly it does), I made a generator for these problems and sampled 50 of them, resulting in an accuracy of 0.12 for GPT-4o with Chain of Thought, where a random baseline gives 0.04. [The code is available here] and you are cordially invited to play around with it. It should be easy to rescore since it’s using AISI’s new Inspect eval framework, which I can tentatively recommend over the alternatives. I wonder how other models do on this(, like Claude 3.5, which I had some trouble getting access to).

This really should not be a fundamentally hard problem, and I expect an extensively fine-tuned GPT-4o to do significantly better (>0.6).[1] It remains puzzling to me though how far current models seem to be from a consistently good 0-shot performance while performing so well in other areas, echoing the ARC benchmark but perhaps less controversial in regards to whether humans are good at it. Just from the general speed of progress, I also expect this being solved (>0.95) by a general-purpose multimodal model soon. Only, before or after IMO?

  1. ^

    Again attempts at this are very much invited, I just don’t know of any vision finetuning API.