Scenario Forecasting Workshop: Materials and Learnings
Disclaimer: While some participants and organizers of this exercise work in industry, no proprietary info was used to inform these scenarios, and they represent the views of their individual authors alone.
Overview
In the vein of What 2026 Looks Like and AI Timelines discussion, we recently hosted a scenario forecasting workshop. Participants first wrote a 5-stage scenario forecasting what will happen between now and ASI. Then, they reviewed, discussed, and revised scenarios in groups of 3. The discussion was guided by forecasts like “If I were to observe this person’s scenario through stage X, what would my ASI timelines median be?”.
Instructions for running the workshop including notes on what we would do differently are available here. We’ve put 6 shared scenarios from our workshop in a publicly viewable folder here.
Motivation
Writing scenarios may help to:
Clarify views, e.g. by realizing an abstract view is hard to concretize, or realizing that two views you hold don’t seem very compatible.
Surface new considerations, e.g. realizing a subquestion is more important than you thought, or that an actor might behave in a way you hadn’t considered.
Communicate views to others, e.g. clarifying what you mean by “AGI”, “slow takeoff”, or the singularity.
Register qualitative forecasts, which can then be compared against reality. This has advantages and disadvantages vs. more resolvable forecasts (though scenarios can include some resolvable forecasts as well!).
Running the workshop
Materials and instructions for running the workshop including notes on what we would do differently are available here.
The schedule for the workshop looked like:
Session 1 involved writing a 5-staged scenario forecasting what will happen between now and ASI.
Session 2 involved reviewing, discussing, and revising scenarios in groups of 3. The discussion was guided by forecasts like “If I were to observe this person’s scenario through stage X, what would my ASI timelines median be?”. There were analogous questions for p(disempowerment) and p(good future).
Session 3 was freeform discussion and revision within groups, then there was a brief session for feedback.
Workshop outputs and learnings
6 people (3 anonymous, 3 named) have agreed to share their scenarios. We’ve put them in a publicly viewable folder here.
We received overall positive feedback, with nearly all 23 people who filled out the feedback survey saying it was a good use of their time. In general, people found the writing portion more valuable than the discussion. We’ve included some ideas on how to improve future similar workshops based on this feedback and a few other pieces of feedback in our instructions for organizers. It’s possible that a workshop that is much more focused on the writing relative to the discussion would be more valuable.
Speaking for myself (as Eli), I think it was mostly valuable as a forcing function to get people to do an activity they had wanted to do anyway. And scenario writing seems like a good thing for people to spend marginal time on (especially if they find it fun/energizing). It seems worthwhile to experiment with the format (in the ways we suggest above, or other ways people are excited about). It feels like there might be something nearby that is substantially more valuable than our initial pilot.
Romeo Dean and I ran a slightly modified version of this format for members of AISST and we found it a very useful and enjoyable activity!
We first gathered to do 2 hours of reading and discussing, and then we spent 4 hours switching between silent writing and discussing in small groups.
The main changes we made are:
We removed the part where people estimate probabilities of ASI and doom happening by the end of each other’s scenarios.
We added a formal benchmark forecasting part for 7 benchmarks using private Metaculus questions (forecasting values at Jan 31 2025):
GPQA
SWE-bench
GAIA
InterCode (Bash)
WebArena
Number of METR tasks completed
ELO on LMSys arena relative to GPT-4-1106
We think the first change made it better, but in hindsight we would have reduced the number of benchmarks to around 3 (GPQA, SWE-bench and LMSys ELO), or given participants much more time.
From the Rough Notes section of Ajeya’s shared scenario:
Just to check my understanding, here’s my BOTEC of the number of FLOPs for 50k H100s during a month: 5e4 H100s * 1e15 bf16 FLOPs/second * 0.5 utilization * (3600 * 24 * 30) seconds/month = 6.48e25 FLOPs.
This is indeed close enough to Epoch’s median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs). ETA: see clarification in Eli’s reply.
I’m curious if we have info about the floating point format used for these training runs: how confident are we that labs are using bf16 rather than fp8?
FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.