Kudos for tracking the predictions, and for making the benchmark! I’d be really excited to see more benchmarks that current AI does really badly on being created. Seems like a good way to understand capabilities going forward.
RAB
Karma: 25
I really appreciate your including a number here, that’s useful info. Would love to see more from everyone in the future—I know it takes more time/energy and operationalizations are hard, but I’d vastly prefer to see the easier versions over no versions or norms in favor of only writing up airtight probabilities.
(I also feel much better on an emotional level hearing 20% from you, I would’ve guessed anywhere between 30 and 90%. Others in the community may be similar: I’ve talked to multiple people who were pretty down after reading Eliezer’s last few posts.)
Worth noting that LLMs are no longer using quadratic context window scaling. See e.g. Claude-Long. Seems they’ve figured out how to make it ~linear. Looking at GPT-4 with a 32K context window option for corporate clients, seems like they’re also not using quadratic scaling any more.