Thanks Neel, we agree that we misinterpreted this. We’ve removed the claim.
Lawrence Phillips
Contra papers claiming superhuman AI forecasting
Unit economics of LLM APIs
[EAForum xpost] A breakdown of OpenAI’s revenue
For anyone who’d like to see questions of this type on Metaculus as well, there’s this thread. For certain topics (alignment very much included), we’ll often do the legwork of operationalizing suggested questions and posting them on the platform.
Side note: we’re working on spinning up what is essentially an AI forecasting research program; part of that will involve predicting the level of resources allocated to, and the impact of, different approaches to alignment. I’d be very glad to hear ideas from alignment researchers as to how to best go about this, and how we can make its outputs as useful as possible. John, if you’d like to chat about this, please DM me and we can set up a call.
Annual AGI Benchmarking Event
Nice work. A few comments/questions:
I think you’re being harsh on yourselves by emphasising the cost/benefit ratio. For one, the forecasters were asked to predict Elizabeth van Norstrand’s distributions rather than their mean, right? So this method of scoring would actually reward them for being worse at their jobs, if they happened to put all their mass near the resolution’s mean as opposed to predicting the correct distribution. IMO a more interesting measure is the degree of agreement between the forecasters’ predictions and Elizabeth’s distributions, although I appreciate that that’s hard to condense into an intuitive statistic.
An interesting question this touches on is “Can research be parallelised?”. It would be nice to investigate this more closely. It feels as though different types of research questions might be amenable to different forms of parallelisation involving more or less communication between individual researchers and more or less sophisticated aggregation functions. For example, a strategy where each researcher is explicitly assigned a separate portion of the problem to work on, and at the end the conclusions are synthesised in a discussion among the researchers, might be appropriate for some questions. Do you have any plans to explore directions like these, or do you think that what you did in this experiment (as I understand, ad-hoc cooperation among the forecasters with each submitting a distribution, these then being averaged) is appropriate for most questions? If so, why?
About the anticorrelation between importance and “outsourceablilty”: investigating which types of questions are outsourceable would be super interesting. You’d think there’d be some connection between outsourceable questions and parallelisable problems in computer science. Again, different aggregation functions/incentive structures will lead to different questions being outsourcable.
One potential use case for this kind of thing could be as a way of finding reasonable distributions over answers to questions that require so much information that a single person or small group couldn’t do the research in an acceptable amount of time or correctly synthesise their conclusions by themselves. One could test how plausible this is by looking at how aggregate performance tracks complexity on problems where one person can do the research alone. So an experiment like the one you’ve done, but on questions of varying complexity, starting from trivial up to the limit of what’s feasible.
We’d probably try something along the lines you’re suggesting, but there are some interesting technical challenges to think through.
For example, we’d want to train the model to be good at predicting the future, not just knowing what happened. Under a naive implementation, weight updates would probably go partly towards better judgment and forecasting ability, but also partly towards knowing how the world played out after the initial training cutoff.
There are also questions around IR; it seems likely that models will need external retrieval mechanisms to forecast well for the next few years at least, and we’d want to train something that’s natively good at using retrieval tools to forecast, rather than relying purely on its crystalised knowledge.