I assume that’s from looking at the GPT-4 graph. I think the main graph I’d look at for a judgment like this is probably the first graph in the post, without PaLM-2 and GPT-4. Because PaLM-2 is 1-shot and GPT-4 is just 4 instead of 20+ benchmarks.
That suggests 90% is ~1 OOM away and 95% is ~3 OOMs away.
(And since PaLM-2 and GPT-4 seemed roughly on trend in the places where I could check them, probably they wouldn’t change that too much.)
30,000ft takeaway I got from this: we’re ~ < 2 OOM from 95% performance. Which passes the sniff test, and is also scary/exciting
I assume that’s from looking at the GPT-4 graph. I think the main graph I’d look at for a judgment like this is probably the first graph in the post, without PaLM-2 and GPT-4. Because PaLM-2 is 1-shot and GPT-4 is just 4 instead of 20+ benchmarks.
That suggests 90% is ~1 OOM away and 95% is ~3 OOMs away.
(And since PaLM-2 and GPT-4 seemed roughly on trend in the places where I could check them, probably they wouldn’t change that too much.)