Thomas Kwa answers How far along Metr’s law can AI start automating or helping with alignment research?

Thomas Kwa 21 Mar 2025 0:42 UTC
3 points
0
Author here. My best guess is that by around the 1-month point, AIs will be automating large parts of both AI capabilities and empirical alignment research. Inferring anything more depends on many other beliefs.
Currently no one knows how hard the alignment problem is or what exactly good alignment research means—it is the furthest-looking, least well-defined and least tractable of the subfields of AI existential safety. This means we don’t know the equivalent task length of the alignment problem. Even more importantly, we only measured the AIs at software tasks and don’t know what the trend is for other domains like math or law, it could be wildly different.
With that said, my current guess is that alignment will be sped up by AI slightly less than capabilities will be, success looks like building deferrable AI, and whether we succeed depends on whether the world dedicates more than X% [1] of AI research resources to relevant safety research than the exact software time horizon of the AIs involved, which is not directly applicable.
[1] X is some unknown number probably between 0% and 65%
- Garrett Baker 21 Mar 2025 0:51 UTC
  2 points
  0
  Parent
  
  Even more importantly, we only measured the AIs at software tasks and don’t know what the trend is for other domains like math or law, it could be wildly different.
  
  You probably mention this somewhere, but I’ll ask here, are you currently researching whether these results hold for those other domains? I’m personally more interested about math than law.
  - Thomas Kwa 21 Mar 2025 1:08 UTC
    4 points
    0
    Parent
    It’s expensive to construct and baseline novel tasks for this (we spent well over $100k on human baselines) so what we are able to measure in the future depends on whether we can harvest realistic tasks that naturally have human data. You could do a rough analysis on math contest problems, say assigning GSM8K and AIME questions lengths based on a guess of how long expert humans take, but the external validity concerns are worse than for software. For one thing, AIME has much harder topics than GSM8K (we tried to make SWAA not be artificially easier or harder than HCAST); for another, neither are particularly close to the average few minutes of a research mathematician’s job.