Green bars are GPT-4. Blue bars are not. I suspect they just didn’t retest everything.
They did run the tests for all models, from Table 1:
(the columns are GPT-4, GPT-4 (no vision), GPT-3.5)
It would be weird to include them if they didn’t run those tests. My read was that the green bars are the same height as the blue bars, so they are hidden behind.
Meaning it literally showed zero difference in half the tests? Does that make sense?
AP exams are scored on a scale of 1 to 5, so yes, getting the exact same score with zero difference makes sense.
Roughly 1⁄3 of the tests but yeah, that’s why I’m confused. Looks weird enough.
Current theme: default
Less Wrong (text)
Less Wrong (link)
Arrow keys: Next/previous image
Escape or click: Hide zoomed image
Space bar: Reset image size & position
Scroll to zoom in/out
(When zoomed in, drag to pan; double-click to close)
Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).
]
Keys shown in grey (e.g., ?) do not require any modifier keys.
?
Esc
h
f
a
m
v
c
r
q
t
u
o
,
.
/
s
n
e
;
Enter
[
\
k
i
l
=
-
0
′
1
2
3
4
5
6
7
8
9
→
↓
←
↑
Space
x
z
`
g
Green bars are GPT-4. Blue bars are not. I suspect they just didn’t retest everything.
They did run the tests for all models, from Table 1:
(the columns are GPT-4, GPT-4 (no vision), GPT-3.5)
It would be weird to include them if they didn’t run those tests. My read was that the green bars are the same height as the blue bars, so they are hidden behind.
Meaning it literally showed zero difference in half the tests? Does that make sense?
AP exams are scored on a scale of 1 to 5, so yes, getting the exact same score with zero difference makes sense.
Roughly 1⁄3 of the tests but yeah, that’s why I’m confused. Looks weird enough.