I know. I skimmed the paper, and in it there is a table above the chart showing the results in the tasks for all models (as every model’s performance is below 5% in codeforces, on the chart they overlap). I replied to the comment I replied to because thematically it seemed the most appropriate (asking about task performance), sorry if my choice of where to comment was confusing.
From the table:
GPT-3.5′s codeforces rating is “260 (below 5%)”
GPT-4′s codeforces rating is “392 (below 5%)”
Codeforces is not marked as having a GPT-4 measurement on this chart. Yes, it’s a somewhat confusing chart.
I know. I skimmed the paper, and in it there is a table above the chart showing the results in the tasks for all models (as every model’s performance is below 5% in codeforces, on the chart they overlap). I replied to the comment I replied to because thematically it seemed the most appropriate (asking about task performance), sorry if my choice of where to comment was confusing.
From the table:
GPT-3.5′s codeforces rating is “260 (below 5%)”
GPT-4′s codeforces rating is “392 (below 5%)”