they put substantial probability on the trend being superexponential
I think that’s too speculative.
I also think that around 25-50% of the questions are impossible or mislabeled.
I wouldn’t be surprised if 3-5% of questions were mislabeled or impossible to answer, but 25-50%? You’re basically saying that HLE is worthless. I’m curious why. I mean, I don’t know much about the people who had to sift through all of the submissions, but I’d be surprised if they failed that badly. Plus, there was a “bug bounty” aimed at improving the quality of the dataset.
TBC, my median to superhuman coder is more like 2031.
Guess I’m a pessimist then, mine is more like 2034.
My point was that it’s surprising that AI is so bad at generalizing to tasks that it hasn’t been trained on. I would’ve predicted that generalization would be much better (I also added a link to a post with more examples). This is also why I think creating AGI will be very hard, unless there will be a massive paradigm shift (some new NN architecture or a new way to train NNs).
EDIT: It’s not “Gemini can’t count how many words it has in its output” that surprises me, it’s “Gemini can’t count how many words it has in its output, given that it can code in Python and in a dozen other languages and can also do calculus”.