Historically, I’ve gotten the impression from the AI Safety community that theoretical AI alignment work is promising in long timeline situations but less relevant in short timeline ones. As timelines have shrunk and capabilities have accelerated, I feel like I’ve seen theoretical alignment work appear less frequently and with less fanfare.
However, I think that we should challenge this communal assumption. In particular, not all capabilities advance at the same rate, and I expect that formally verified mathematical reasoning capabilities will accelerate particularly quickly. Formally-verified mathematics has a beautifully clean training signal, it feels like the perfect setup for aggressive amounts of RL. We know that Deepmind’s AlphaProof can get IMO Silver Medal performance (with caveats) while writing proofs in Lean. There’s no reason to expect that performance to have plateaued in the months since AlphaProof’s announcement. o3′s announcement also showed a huge leap in performance on the FrontierMath benchmark—I expect that such performance on general mathematical questions combined with excellent coding ability should open the door for Lean-checkable proofs of non-trivial results. I think that arms race dynamics are at a point where we can expect models capable of writing near-superhuman proofs in Lean to be released in the coming months.
That expectation … really doesn’t feel priced in among my theoretical AI Safety connections right now. Everyone still seems to be conducting the same style of research as three months ago, aware of all the new announcements but powerless to do anything about them. I claim that a theoretical AI Safety community that had internalized this expectation would be doing everything in its power to maximize its efficiency once near-superhuman Lean-proof-writing models become available.
I think that the theoretical alignment community has 2-3 months to prepare for its “second chance.” A window of opportunity in which superstar PhD students or maybe even super-von-Neumann’s can be queried on demand but where more general long-time-horizon agentic planning is still comparatively poor. A window in which problem-solving ability could cease to be a bottleneck, instead replaced by question-posing skill. I think that, of all the flavors of alignment research, theoretical alignment is going to be riding the takeoff curve faster than anything else.
I also claim that nothing short of demanding formal verification for theoretical results will do. Almost all of the current generation of models obfuscate or minimize flaws in their proofs if they cannot fully solve a problem. I expect this to get harder to detect as mathematical capabilities advance. The formal verification community has spent years building tools that completely neutralize the potential of near-superhuman models to deceive us mathematically, it feels nonsensical to ignore those tools.
What do I think the community should do to prepare?
Formalize quantities of interest / question statements in every field touched by theoretical AI Safety. We don’t need to formalize proofs of already-known theorems, we just need a consensus on formal definitions so we’re not wasting time writing those down when the floodgates open.
Prepare question banks. Seriously, I think it would make sense to have lists of 100-1000 questions per researcher so that if Deepmind releases AlphaProof 2.0 tomorrow people don’t waste a week or two thinking of what to ask it.
Theoretical Alignment’s Second Chance
Historically, I’ve gotten the impression from the AI Safety community that theoretical AI alignment work is promising in long timeline situations but less relevant in short timeline ones. As timelines have shrunk and capabilities have accelerated, I feel like I’ve seen theoretical alignment work appear less frequently and with less fanfare.
However, I think that we should challenge this communal assumption. In particular, not all capabilities advance at the same rate, and I expect that formally verified mathematical reasoning capabilities will accelerate particularly quickly. Formally-verified mathematics has a beautifully clean training signal, it feels like the perfect setup for aggressive amounts of RL. We know that Deepmind’s AlphaProof can get IMO Silver Medal performance (with caveats) while writing proofs in Lean. There’s no reason to expect that performance to have plateaued in the months since AlphaProof’s announcement. o3′s announcement also showed a huge leap in performance on the FrontierMath benchmark—I expect that such performance on general mathematical questions combined with excellent coding ability should open the door for Lean-checkable proofs of non-trivial results. I think that arms race dynamics are at a point where we can expect models capable of writing near-superhuman proofs in Lean to be released in the coming months.
That expectation … really doesn’t feel priced in among my theoretical AI Safety connections right now. Everyone still seems to be conducting the same style of research as three months ago, aware of all the new announcements but powerless to do anything about them. I claim that a theoretical AI Safety community that had internalized this expectation would be doing everything in its power to maximize its efficiency once near-superhuman Lean-proof-writing models become available.
I think that the theoretical alignment community has 2-3 months to prepare for its “second chance.” A window of opportunity in which superstar PhD students or maybe even super-von-Neumann’s can be queried on demand but where more general long-time-horizon agentic planning is still comparatively poor. A window in which problem-solving ability could cease to be a bottleneck, instead replaced by question-posing skill. I think that, of all the flavors of alignment research, theoretical alignment is going to be riding the takeoff curve faster than anything else.
I also claim that nothing short of demanding formal verification for theoretical results will do. Almost all of the current generation of models obfuscate or minimize flaws in their proofs if they cannot fully solve a problem. I expect this to get harder to detect as mathematical capabilities advance. The formal verification community has spent years building tools that completely neutralize the potential of near-superhuman models to deceive us mathematically, it feels nonsensical to ignore those tools.
What do I think the community should do to prepare?
Formalize quantities of interest / question statements in every field touched by theoretical AI Safety. We don’t need to formalize proofs of already-known theorems, we just need a consensus on formal definitions so we’re not wasting time writing those down when the floodgates open.
Prepare question banks. Seriously, I think it would make sense to have lists of 100-1000 questions per researcher so that if Deepmind releases AlphaProof 2.0 tomorrow people don’t waste a week or two thinking of what to ask it.
Clock’s on. Good luck everyone.