Current LLMs do quite badly on the ARC visual puzzles, which are reasonably easy for smart humans.
We do not in fact have strong evidence for this. There does not exist any baseline for ARC puzzles among humans, smart or otherwise, just a claim that two people the designers asked to attempt them were able to solve them all. It seems entirely plausible to me that the best score on that leaderboard is pretty close to the human median.
Edit: I failed to mention that there is a baseline on the test set, which is different from the eval set that is used for the scoreboard and is, I believe, significantly easier.
Yeah, I failed to mention this. Edited to clarify what I meant.