How is it that bad at codeforces? I competed a few years ago, but in my time div 2 a and b were extremely simple, basically just “implement the described algorithm in code” and if you submitted them quickly (which I expect gpt-4 would excel in) it was easy to reach a significantly better rating than the one reported by this paper.
I hope they didn’t make a mistake by misunderstanding the codeforces rating system (codeforces only awards a fraction of the “estimated rating-current rating” after a competition, but it is possible to exactly calculate the rating equivalent to the given performance from the data provided if you know the details (which I forgot))
When searching the paper for the exact methodology (by ctrl-f’ing “codeforces”), I haven’t found anything.
I know. I skimmed the paper, and in it there is a table above the chart showing the results in the tasks for all models (as every model’s performance is below 5% in codeforces, on the chart they overlap). I replied to the comment I replied to because thematically it seemed the most appropriate (asking about task performance), sorry if my choice of where to comment was confusing.
From the table:
GPT-3.5′s codeforces rating is “260 (below 5%)”
GPT-4′s codeforces rating is “392 (below 5%)”
I think performance on AP english might be a quirk of how they dealt with dataset contamination. English and Literature exams showed anomalous amount of contamination (lots of the famous texts are online and referenced elsewhere) so they threw out most of the questions, leading to a null conclusion about performance.
It would be weird to include them if they didn’t run those tests. My read was that the green bars are the same height as the blue bars, so they are hidden behind.
Why doesn’t it improve on AP English Literature and AP English Language?
I don’t have a good guess, but I found the AP English Language exam description with example questions and grading procedures if anyone wants to take a look.
How is it that bad at codeforces? I competed a few years ago, but in my time div 2 a and b were extremely simple, basically just “implement the described algorithm in code” and if you submitted them quickly (which I expect gpt-4 would excel in) it was easy to reach a significantly better rating than the one reported by this paper.
I hope they didn’t make a mistake by misunderstanding the codeforces rating system (codeforces only awards a fraction of the “estimated rating-current rating” after a competition, but it is possible to exactly calculate the rating equivalent to the given performance from the data provided if you know the details (which I forgot))
When searching the paper for the exact methodology (by ctrl-f’ing “codeforces”), I haven’t found anything.
Codeforces is not marked as having a GPT-4 measurement on this chart. Yes, it’s a somewhat confusing chart.
I know. I skimmed the paper, and in it there is a table above the chart showing the results in the tasks for all models (as every model’s performance is below 5% in codeforces, on the chart they overlap). I replied to the comment I replied to because thematically it seemed the most appropriate (asking about task performance), sorry if my choice of where to comment was confusing.
From the table:
GPT-3.5′s codeforces rating is “260 (below 5%)”
GPT-4′s codeforces rating is “392 (below 5%)”
Perhaps the model wasn’t allowed to read the sources for the free response section?
I think performance on AP english might be a quirk of how they dealt with dataset contamination. English and Literature exams showed anomalous amount of contamination (lots of the famous texts are online and referenced elsewhere) so they threw out most of the questions, leading to a null conclusion about performance.
Green bars are GPT-4. Blue bars are not. I suspect they just didn’t retest everything.
They did run the tests for all models, from Table 1:
(the columns are GPT-4, GPT-4 (no vision), GPT-3.5)
It would be weird to include them if they didn’t run those tests. My read was that the green bars are the same height as the blue bars, so they are hidden behind.
Meaning it literally showed zero difference in half the tests? Does that make sense?
AP exams are scored on a scale of 1 to 5, so yes, getting the exact same score with zero difference makes sense.
Roughly 1⁄3 of the tests but yeah, that’s why I’m confused. Looks weird enough.