Engineer at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
Engineer at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
Humans don’t need 10x more memory per step nor 100x more compute to do a 10-year project than a 1-year project, so this is proof it isn’t a hard constraint. It might need an architecture change but if the Gods of Straight Lines control the trend, AI companies will invent it as part of normal algorithmic progress and we will remain on an exponential / superexponential trend.
Regarding 1 and 2, I basically agree that SWAA doesn’t provide much independent signal. The reason we made SWAA was that models before GPT-4 got ~0% on HCAST, so we needed shorter tasks to measure their time horizon. 3 is definitely a concern and we’re currently collecting data on open-source PRs to get a more representative sample of long tasks.
That bit at the end about “time horizon of our average baseliner” is a little confusing to me, but I understand it to mean “if we used the 50% reliability metric on the humans we had do these tasks, our model would say humans can’t reliably perform tasks that take longer than an hour”. Which is a pretty interesting point.
That’s basically correct. To give a little more context for why we don’t really believe this number, during data collection we were not really trying to measure the human success rate, just get successful human runs and measure their time. It was very common for baseliners to realize that finishing the task would take too long, give up, and try to collect speed bonuses on other tasks. This is somewhat concerning for biasing the human time-to-complete estimates, but much more concerning for this human time horizon measurement. So we don’t claim the human time horizon as a result.
All models since at least GPT-3 have had this steep exponential decay [1], and the whole logistic curve has kept shifting to the right. The 80% success rate horizon has basically the same 7-month doubling time as the 50% horizon so it’s not just an artifact of picking 50% as a threshold.
Claude 3.7 isn’t doing better on >2 hour tasks than o1, so it might be that the curve is compressing, but this might also just be noise or imperfect elicitation.
Regarding the idea that autoregressive models would plateau at hours or days, it’s plausible, and one point of evidence is that models are not really coherent over hundreds of steps (generations + uses of the Python tool) yet—they do 1-2 hour tasks with ~10 actions, see section 5 of HCAST paper. On the other hand, current LLMs can learn a lot in-context and it’s not clear there are limits to this. In our qualitative analysis we found evidence of increasing coherence, where o1 fails tasks due to repeating failed actions 6x less than GPT-4 1106.
Maybe this could be tested by extracting ~1 hour tasks out of the hours to days long projects that we think are heavy in self-modeling, like planning. But we will see whether there’s a plateau at the hours range in the next year or two anyway.
[1] we don’t have easy enough tasks that GPT-2 can do them with >50% success, so can’t check the shape
It’s expensive to construct and baseline novel tasks for this (we spent well over $100k on human baselines) so what we are able to measure in the future depends on whether we can harvest realistic tasks that naturally have human data. You could do a rough analysis on math contest problems, say assigning GSM8K and AIME questions lengths based on a guess of how long expert humans take, but the external validity concerns are worse than for software. For one thing, AIME has much harder topics than GSM8K (we tried to make SWAA not be artificially easier or harder than HCAST); for another, neither are particularly close to the average few minutes of a research mathematician’s job.
The trend probably sped up in 2024. If the future trend follows the 2024--2025 trend, we get 50% reliability at 167 hours in 2027.
Author here. My best guess is that by around the 1-month point, AIs will be automating large parts of both AI capabilities and empirical alignment research. Inferring anything more depends on many other beliefs.
Currently no one knows how hard the alignment problem is or what exactly good alignment research means—it is the furthest-looking, least well-defined and least tractable of the subfields of AI existential safety. This means we don’t know the equivalent task length of the alignment problem. Even more importantly, we only measured the AIs at software tasks and don’t know what the trend is for other domains like math or law, it could be wildly different.
With that said, my current guess is that alignment will be sped up by AI slightly less than capabilities will be, success looks like building deferrable AI, and whether we succeed depends on whether the world dedicates more than X% [1] of AI research resources to relevant safety research than the exact software time horizon of the AIs involved, which is not directly applicable.
[1] X is some unknown number probably between 0% and 65%
AIs (and humans) don’t have 100% reliability at anything, so the graph tracks when AIs get a 50% success rate on our dataset, over all tasks and attempts. We also measure AI horizons at 80% success rate in the paper, and those are about 5x shorter. It’s hard to measure much higher than 80% with our limited task suite, but if we could we would measure 95% and 99% as well.
So the citation is an unreleased paper! That unreleased paper may make a splash, since (assuming this 7-month-doubling trend is not merely 1-2 years old) it strongly implies we really will find good solutions for turning LLMs agentic fairly soon.
The 7-month doubling trend we measured actually goes back to GPT-2 in 2019. Since 2024, the trend has been faster, doubling roughly every 3-4 months depending on how you measure, but we only have six 2024-25 models so error bars are wide and it’s really unclear which trend will be predictive of the future.
Important results from the paper as I see it:
RL agents can learn to cheat in the wild, and CoT pressure doesn’t prevent it.
The slope of cheating increase and final cheating rate are both slightly lower when applying CoT pressure, but all of the cheating will go uncaught by the monitor.
GPT-4o can zero-shot monitor o3-level agents with 95% accuracy when it has access to CoT, only 60% when it can only monitor actions.
If there is no CoT pressure, RL agents telegraph their cheating, such that cheating precision and recall actually go up over time for a monitor that can only see CoT
True utility (success rate counting cheating as failure) initially increases faster with CoT pressure than without, despite low initial levels of cheating in both.
I’m not sure why this is given they said the only change in reward related to CoT pressure is that the agent doesn’t get reward when the monitor flags.
If you allow rejection sampling, the utility curve is roughly similar between CoT pressure and no CoT pressure. This rhymes with the Scaling Laws for Reward Model Overoptimization result where utility is similar between RL and best-of-k
Speculation:
The information that the model is cheating seems basically gone from the CoT, so training against a weak monitor and monitoring using a stronger monitor could be unreliable
How much CoT pressure should you apply to maximize utility? Figure 4 suggests that if you do RL, cheating will happen at roughly the same point in training with or without CoT pressure (~0.3 epochs), which also coincides with the maximum utility point, so stop at the point where the non-CoT-pressure model starts to do detectable cheating.
It is unclear how this applies to cheating / other malicious behaviors that are hard enough that they require CoT. My prior is that capabilities at hiding misaligned behaviors under CoT monitoring will improve at similar rates to capabilities at reasoning without CoT in other domains, although the presence of a monitor might specifically target the skill of obfuscating misaligned reasoning and make it happen sooner.
I think we have two separate claims here:
Do technologies that have lots of resources put into their development generally improve discontinuously or by huge slope changes?
Do technologies often get displaced by technologies with a different lineage?
I agree with your position on (2) here. But it seems like the claim in the post that sometime in the 2030s someone will make a single important architectural innovation that leads to takeover within a year mostly depends on (1), as it would require progress within that year to be comparable to all the progress from now until that year. Also you said the architectural innovation might be a slight tweak to the LLM architecture, which would mean it shares the same lineage.
The history of machine learning seems pretty continuous wrt advance prediction. In the Epoch graph, the line fit on loss of the best LSTM up to 2016 sees a slope change of less than 2x, whereas a hypothetical innovation that causes takeover within a year with not much progress in the intervening 8 years would be ~8x. So it seems more likely to me (conditional on 2033 timelines and a big innovation) that we get some architectural innovation which has a moderately different lineage in 2027, it overtakes transformers’ performance in 2029, and afterward causes the rate of AI improvement to increase by something like 1.5x-2x.
2 out of 3 of the technologies you listed probably have continuous improvement despite the lineage change
1910-era cars were only a little better than horses, and the overall speed someone could travel long distances in the US probably increased in slope by <2x after cars due to things like road quality improvement before cars and improvements in ships and rail (though maybe railroads were a discontinuity, not sure)
Before refrigerators we had low-quality refrigerators that would contaminate the ice with ammonia, and before that people shipped ice from Maine, so I would expect the cost/quality of refrigeration to have much less than an 8x slope change at the advent of mechanical refrigeration
Only rockets were actually a discontinuity
Tell me if you disagree.
Easy verification makes for benchmarks that can quickly be cracked by LLMs. Hard verification makes for benchmarks that aren’t used.
Agree, this is one big limitation of the paper I’m working on at METR. The first two ideas you listed are things I would very much like to measure, and the third something I would like to measure but is much harder than any current benchmark given that university takes humans years rather than hours. If we measure it right, we could tell whether generalization is steadily improving or plateauing.
Though the fully connected → transformers wasn’t infinite small steps, it definitely wasn’t a single step. We had to invent various sub-innovations like skip connections separately, progressing from RNNs to LSTM to GPT/BERT style transformers to today’s transformer++. The most you could claim is a single step is LSTM → transformer.
Also if you graph perplexity over time, there’s basically no discontinuity from introducing transformers, just a possible change in slope that might be an artifact of switching from the purple to green measurement method. The story looks more like transformers being more able to utilize the exponentially increasing amounts of compute that people started using just before its introduction, which caused people to invest more in compute and other improvements over the next 8 years.
We could get another single big architectural innovation that gives better returns to more compute, but I’d give a 50-50 chance that it would be only a slope change, not a discontinuity. Even conditional on discontinuity it might be pretty small. Personally my timelines are also short enough that there is limited time for this to happen before we get AGI.
A continuous manifold of possible technologies is not required for continuous progress. All that is needed is for there to be many possible sources of improvements that can accumulate, and for these improvements to be small once low-hanging fruit is exhausted.
Case in point: the nanogpt speedrun, where the training time of a small LLM was reduced by 15x using 21 distinct innovations which touched basically every part of the model, including the optimizer, embeddings, attention, other architectural details, quantization, hyperparameters, code optimizations, and Pytorch version.
Most technologies are like this, and frontier AI has even more sources of improvement than the nanogpt speedrun because you can also change the training data and hardware. It’s not impossible that there’s a moment in AI like the invention of lasers or the telegraph, but this doesn’t happen with most technologies, and the fact that we have scaling laws somewhat points towards continuity even as other things like small differences being amplified in downstream metrics point to discontinuity. Also see my comment here on a similar topic.
If you think generalization is limited in the current regime, try to create AGI-complete benchmarks that the AIs won’t saturate until we reach some crucial innovation. People keep trying this and they keep saturating every year.
I think eating the Sun is our destiny, both in that I expect it to happen and that I would be pretty sad if we didn’t; I just hope it will be done ethically. This might seem like a strong statement but bear with me
Our civilization has undergone many shifts in values as higher tech levels have indicated that sheer impracticality of living a certain way, and I feel okay about most of these. You won’t see many people nowadays who avoid being photographed because photos steal a piece of their soul. The prohibition on women working outside the home, common in many cultures, is on its way out. Only a few groups like the Amish avoid using electricity for culture reasons. The entire world economy runs on usury stacked upon usury. People cared about all of these things strongly, but practicality won.
To believe that eating the Sun is potentially desirable, you don’t have to have linear utility in energy/mass/whatever and want to turn it into hedonium. It just seems like extending the same sort of tradeoffs societies make every day in 2025 leads to eating the sun, considering just how large a fraction of available resources it will represent to a future civilization. The Sun is 99.9% of the matter and more than 99.9% of the energy in the solar system, and I can’t think of any examples of a culture giving up even 99% of its resources for cultural reasons. No one bans eating 99.9% of available calories, farming 99.9% of available land, or working 99.9% of jobs. Today, traditionally minded and off-grid people generally strike a balance between commitment to their lifestyle and practicality, and many of them use phones and hospitals. Giving up 99.9% of resources would mean giving up metal and basically living in the Stone Age. [1]
When eating the Sun, as long as we spend 0.0001% of the Sun’s energy to set up an equivalent light source pointing at Earth, it doesn’t prevent people from continuing to live on Earth, spending their time farming potatoes and painting, nor does it destroy any habitats. There is really nothing of great intrinsic value lost here. We can’t do the same today when destroying the rainforests! If people block eating the Sun and this is making peoples’ lives worse it’s plausible we should think of them like NIMBYs who prevent dozens of poor people from getting housing because it would ruin their view.
The closest analogies I can think of in the present day are nuclear power bans and people banning vitamin-enriched GMO crops even as children were dying of malnutrition. With nuclear, energy is cheap enough that people can still heat their homes without, so maybe we’ll have an analogous situation where energy is much cheaper than non-hydrogen matter during the period when we would want to eat the Sun. (We would definitely disassemble most of the planets though, unless energy and matter are both cheap relative to some third thing but I don’t see what that would be.) With GMOs I feel pretty sad about the whole situation and wish that science communication were better. At least if we fail to eat the sun and distribute gains to society people probably wouldn’t die as a result.
[1] It might be that the 1000xing income is less valuable in the future than it was in the Neolithic, but probably a Neolithic person would also be skeptical that 1000xing resources is valuable until you explain what technology can do now. If we currently value talking to people across the world, why wouldn’t future people value running 10,000 copies of themselves to socialize with all their friends at once?
Will we ever have Poké Balls in real life? How fast could they be at storing and retrieving animals? Requirements:
Made of atoms, no teleportation or fantasy physics.
Small enough to be easily thrown, say under 5 inches diameter
Must be able to disassemble and reconstruct an animal as large as an elephant in a reasonable amount of time, say 5 minutes, and store its pattern digitally
Must reconstruct the animal to enough fidelity that its memories are intact and it’s physically identical for most purposes, though maybe not quite to the cellular level
No external power source
Works basically wherever you throw it, though it might be slower to print the animal if it only has air to use as feedstock mass or can’t spread out to dissipate heat
Should not destroy nearby buildings when used
Animals must feel no pain during the process
It feels pretty likely to me that we’ll be able to print complex animals eventually using nanotech/biotech, but the speed requirements here might be pushing the limits of what’s possible. In particular heat dissipation seems like a huge challenge; assuming that 0.2 kcal/g of waste heat is created while assembling the elephant, which is well below what elephants need to build their tissues, you would need to dissipate about 5 GJ of heat, which would take even a full-sized nuclear power plant cooling tower a few seconds. Power might be another challenge. Drexler claims you can eat fuel and oxidizer, turn all the mass into basically any lower-energy state, and come out easily net positive on energy. But if there is none available you would need a nuclear reactor.
and yet, the richest person is still only responsible for 0.1%* of the economic output of the united states.
Musk only owns 0.1% of the economic output of the US but he is responsible for more than this, including large contributions to
Politics
Space
SpaceX is nearly 90% of global upmass
Dragon is the sole American spacecraft that can launch humans to ISS
Starlink probably enables far more economic activity than its revenue
Quality and quantity of US spy satellites (Starshield has ~tripled NRO satellite mass)
Startup culture through the many startups from ex-SpaceX employees
Twitter as a medium of discourse, though this didn’t change much
Electric cars probably sped up by ~1 year by Tesla, which still owns over half the nation’s charging infrastructure
AI, including medium-sized effects on OpenAI and potential future effects through xAI
Depending on your reckoning I wouldn’t be surprised if Elon’s influence added up to >1% of Americans combined. This is not really surprising because a Zipfian relationship would give the top person in a nation of 300 million 5% of the total influence.
Agree that AI takeoff could likely be faster than our OODA loop.
There are four key differences between this and the current AI situation that I think makes this perspective pretty outdated:
AIs are made out of ML, so we have very fine-grained control over how we train them and modify them for deployment, unlike animals which have unpredictable biological drives and long feedback loops.
By now, AIs are obviously developing generalized capabilities. Rather than arguments over whether AIs will ever be superintelligent, the bulk of the discourse is over whether they will supercharge economic growth or cause massive job loss and how quickly.
There are at least 10 companies that could build superintelligence within 10ish years and their CEOs are all high on motivated reasoning, so stopping is infeasible
Current evidence points to takeoff being continuous and merely very fast—even automating AI R&D won’t cause the hockey-stick graph that human civilization had
External validity is a huge concern, so we don’t claim anything as ambitious as average knowledge worker tasks. In one sentence, my opinion is that our tasks suite is fairly representative of well-defined, low-context, measurable software tasks that can be done without a GUI. More speculatively, horizons on this are probably within a large (~10x) constant factor of horizons on most other software tasks. We have a lot more discussion of this in the paper, especially in heading 7.2.1 “Systematic differences between our tasks and real tasks”. The HCAST paper also has a better description of the dataset.
We didn’t try to make the dataset a perfectly stratified sample of tasks meeting that description, but there is enough diversity in the dataset that I’m much more concerned about relevance of HCAST-like tasks to real life than relevance of HCAST to the universe of HCAST-like tasks.