Green bars are GPT-4. Blue bars are not. I suspect they just didn’t retest everything.
They did run the tests for all models, from Table 1:
(the columns are GPT-4, GPT-4 (no vision), GPT-3.5)
It would be weird to include them if they didn’t run those tests. My read was that the green bars are the same height as the blue bars, so they are hidden behind.
Meaning it literally showed zero difference in half the tests? Does that make sense?
AP exams are scored on a scale of 1 to 5, so yes, getting the exact same score with zero difference makes sense.
Roughly 1⁄3 of the tests but yeah, that’s why I’m confused. Looks weird enough.
Green bars are GPT-4. Blue bars are not. I suspect they just didn’t retest everything.
They did run the tests for all models, from Table 1:
(the columns are GPT-4, GPT-4 (no vision), GPT-3.5)
It would be weird to include them if they didn’t run those tests. My read was that the green bars are the same height as the blue bars, so they are hidden behind.
Meaning it literally showed zero difference in half the tests? Does that make sense?
AP exams are scored on a scale of 1 to 5, so yes, getting the exact same score with zero difference makes sense.
Roughly 1⁄3 of the tests but yeah, that’s why I’m confused. Looks weird enough.