This all does seem like work better done than not done, who knows, usefulness could ensue in various ways and downsides seem relatively small.
I disagree about item #1, automating formal verification. From the paper:
9.1 Automate formal verification:
As described above, formal verification and automatic theorem proving more generally needs to be fully automated. The awe-inspiring potential of LLMs and other modern AI tools to help with this should be fully realized.
Training LLMs to do formal verification seems dangerous. In fact, I think I would extend that to any method of automating formal verification that would be competitive with human experts. Even if it didn’t use ML at all, the publication of a superhuman theorem-proving AI, or even just the public knowledge that such a thing existed, seems likely to lead to the development of more general AIs with similar capabilities within a few years. Without a realistic plan for how to use such a system to solve the hard parts of AI alignment, I predict that it would just shorten the timeline to unaligned superintelligence, by enabling systems that are better at sustaining long chains of reasoning, which is one of the major advantages humans still have over AIs. I worry that vague talk of using formal verification for AI safety is in effect safety-washing a dangerous capabilities research program.
All that said, a superhuman formal-theorem-proving assistant would be a super-cool toy, so if anyone has a more detailed argument for why it would actually be a net win for safety in expectation, I’d be interested to hear it.
Formally proving that some X you could realistically build has property Y is way harder than building an X with property Y. I know of no exceptions (formal proof only applies to programs and other mathematical objects). Do you disagree?
I don’t understand why you expect the existence of a “formal math bot” to lead to anything particularly dangerous, other than by being another advance in AI capabilities which goes along other advances (which is fair I guess).
Human-long chains of reasoning (as used for taking action in the real world) neither require nor imply the ability to write formal proofs. Formal proofs are about math and making use of math in the real world requires modelling, which is crucial, hard and usually very informal. You make assumptions that are obviously wrong, derive something from these assumptions, and make an educated guess that the conclusions still won’t be too far from the truth in the ways you care about. In the real world, this only works when your chain of reasoning is fairly short (human-length), just as arbitrarily complex and long-term planning doesn’t work, while math uses very long chains of reasoning. The only practically relevant application so-far seems cryptography because computers are extremely reliable and thus modeling is comparatively easy. However, plausibly it’s still easier to break some encryption scheme than to formally prove that your practically relevant algorithm could break it.
LLMs that can do formal proof would greatly improve cybersecurity across the board (good for delaying some scenarios of AI takeover!). I don’t think they would advance AI capabilities beyond the technological advances used to build them and increasing AI hype. However, I also don’t expect to see useful formal proofs about useful LLMs in my lifetime (you could call this “formal interpretability”? We would first get “informal interpretability” that says useful things about useful models.) Maybe some other AI approach will be more interpretable.
Fundamentally, the objection stands that you can’t prove anything about the real world without modeling, and modeling always yields a leaky abstraction. So we would have to figure out “assumptions that allow to prove that AI won’t kill us all while being only slightly false and in the right ways”. This doesn’t really solve the “you only get one try problem”. Maybe it could help a bit anyway?
I expect a first step might be an AI test lab with many layers of improving cybersecurity, ending at formally verified, air-gapped, no interaction to humans. However, it doesn’t look like people are currently worried enough to bother building something like this. I also don’t see such an “AI lab leak” as the main path towards AI takeover. Rather, I expect we will deploy the systems ourselves and on purpose, finding us at the mercy of competing intelligences that operate at faster timescales than us, and losing control.
I disagree about item #1, automating formal verification. From the paper:
Training LLMs to do formal verification seems dangerous. In fact, I think I would extend that to any method of automating formal verification that would be competitive with human experts. Even if it didn’t use ML at all, the publication of a superhuman theorem-proving AI, or even just the public knowledge that such a thing existed, seems likely to lead to the development of more general AIs with similar capabilities within a few years. Without a realistic plan for how to use such a system to solve the hard parts of AI alignment, I predict that it would just shorten the timeline to unaligned superintelligence, by enabling systems that are better at sustaining long chains of reasoning, which is one of the major advantages humans still have over AIs. I worry that vague talk of using formal verification for AI safety is in effect safety-washing a dangerous capabilities research program.
All that said, a superhuman formal-theorem-proving assistant would be a super-cool toy, so if anyone has a more detailed argument for why it would actually be a net win for safety in expectation, I’d be interested to hear it.
Formally proving that some X you could realistically build has property Y is way harder than building an X with property Y. I know of no exceptions (formal proof only applies to programs and other mathematical objects). Do you disagree?
I don’t understand why you expect the existence of a “formal math bot” to lead to anything particularly dangerous, other than by being another advance in AI capabilities which goes along other advances (which is fair I guess).
Human-long chains of reasoning (as used for taking action in the real world) neither require nor imply the ability to write formal proofs. Formal proofs are about math and making use of math in the real world requires modelling, which is crucial, hard and usually very informal. You make assumptions that are obviously wrong, derive something from these assumptions, and make an educated guess that the conclusions still won’t be too far from the truth in the ways you care about. In the real world, this only works when your chain of reasoning is fairly short (human-length), just as arbitrarily complex and long-term planning doesn’t work, while math uses very long chains of reasoning. The only practically relevant application so-far seems cryptography because computers are extremely reliable and thus modeling is comparatively easy. However, plausibly it’s still easier to break some encryption scheme than to formally prove that your practically relevant algorithm could break it.
LLMs that can do formal proof would greatly improve cybersecurity across the board (good for delaying some scenarios of AI takeover!). I don’t think they would advance AI capabilities beyond the technological advances used to build them and increasing AI hype. However, I also don’t expect to see useful formal proofs about useful LLMs in my lifetime (you could call this “formal interpretability”? We would first get “informal interpretability” that says useful things about useful models.) Maybe some other AI approach will be more interpretable.
Fundamentally, the objection stands that you can’t prove anything about the real world without modeling, and modeling always yields a leaky abstraction. So we would have to figure out “assumptions that allow to prove that AI won’t kill us all while being only slightly false and in the right ways”. This doesn’t really solve the “you only get one try problem”. Maybe it could help a bit anyway?
I expect a first step might be an AI test lab with many layers of improving cybersecurity, ending at formally verified, air-gapped, no interaction to humans. However, it doesn’t look like people are currently worried enough to bother building something like this. I also don’t see such an “AI lab leak” as the main path towards AI takeover. Rather, I expect we will deploy the systems ourselves and on purpose, finding us at the mercy of competing intelligences that operate at faster timescales than us, and losing control.