Steve_Omohundro comments on Limitations on Formal Verification for AI Safety

Steve_Omohundro 21 Aug 2024 15:46 UTC
13 points
1
Here’s the post’s first proposed limitation:
Limitation 1 – We will not obtain strong proofs or formal guarantees about the behavior of AI systems in the physical world. At best we may obtain guarantees about rough approximations of such behavior, over short periods of time.
For many readers with real-world experience working in applied math, the above limitation may seem so obvious they may wonder whether it is worth stating at all. The reasons why it is are twofold. First, researchers advocating for GS methods appear to be specifically arguing for the likelihood of near-term solutions that could somehow overcome Limitation 1.
Think about this claimed limitation in the context of today’s deployed technologies and infrastructure. There are 5 million bridges around the world that people use every day to cross rivers and chasms. Many of them have lasted for hundreds of years and new ones are built with confidence all the time. Failures do happen, but they are so rare that news stories are written about them and videos are shared on YouTube. How are these remarkably successful technologies built?
Mechanical engineers have built up a body of knowledge about building safe bridges. They have precise rules about the structure of successful bridges. They have detailed equations, simulations, and tables of material properties. And they know the regimes of validity of their models. When they build a bridge according to these rules, they have a confidence in their safety which is nothing like a “rough approximation..[valid] over short periods of time.”
We are rapidly moving from human engineers to AI systems for design. What should an AI bridge design system look like? It should certainly encode all the rules that the human engineers use! And, in fact, should give confidence that its designs are actually following those rules, it would be really great to check that a design follows those rules. We could hire a human mechanical engineer to audit the AI design, but we’d really like to automate that process.
Enter formal systems like Lean! We can encode the mechanical engineer’s design criteria and particular bridge designs as precise Lean statements. Now the statement that a particular design follows the design criteria becomes a theorem and Lean can provide a proof certificate that it is true. This certificate can be automatically checked by a proof checker and users can have high confidence that the design does indeed follow the design rules without having to trust anyone!
This formalization process is happening right now in hundreds of different disciplines. It is still new, so early attempts may be incomplete and require engineers to learn to use new tools. But AI is helping here as well. The field of “autoformalization” takes textbooks, manuals, design documents, software, scientific articles, mathematics papers, etc. written in natural language and automatically generates precise formal models of them. Large language models are becoming quite good at this (eg. here’s a recent paper: “Don’t Trust: Verify—Grounding LLM Quantitative Reasoning with Autoformalization”).
As AI autoformalization and theorem proving improves, we should expect the entire corpus of scientific and engineering knowledge to be rapidly represented in a precise formal form. And AI designs which at least meet the current human requirements for safe systems can be automatically generated with proof certificates of adherence to design rules given.
But we cannot stop at the current level of engineering safety. Today’s technologies and infrastructure works pretty well in today’s human environment. Unfortunately, as AIs become more powerful, they will likely be used in an adverarial way to attack infrastructure, both at the behest of malicious human actors and in the process of satisfying autonomous subgoals. To counter this threat, we need a much higher standard of safety than is common in current engineering practice.
Max and I argue that the only way to have assurance of safety against powerful adversarial agents is through formal methods and mathematical proof. For even moderately complex systems, we will need formal guarantees of safety against every attack path that is available to a specified class of adversaries. This is the level of safety for which we need Provably Safe and Guaranteed Safe approaches to achieve.
- habryka 21 Aug 2024 17:14 UTC
  9 points
  15
  Parent
  I think you are wrong about what a proof that following the mechanical engineer’s design criteria would actually do.
  Our bridge design criteria are absolutely not robust to adversarial optimization. It is true that we have come up with a set of guiding principles where if a human who is genuinely interested in building a good bridge follows them then that results (most of the time) into a bridge that doesn’t fall down. But that doesn’t really generalize at all to what would happen if an AI system wants to design a bridge with some specific fault, but is constrained in following the guiding principles. We are not anywhere close to having guiding principles that are operationalized enough so that we could actually prove adherence to them, or guiding principles that even if they could be operationalized, would be robust to adversarial pressure.
  As such I am confused what the goal is here. If I somehow ended up in charge of building a bridge according to modern design principles, but I actually wanted it to collapse, I don’t think I would have any problems in doing so. If I perform an adversarial search over ways to build a bridge that happen to follow the design principles, but where I explicitly lean into the areas where the specification comes apart, then the guidelines will very quickly lose their validity.
  I think the fact that there exists a set of instructions that you can give to well-intentioned humans that usually results in a reliable outcome is very little evidence that we are anywhere close to a proof system to which adherence could be formally verified, and would actually be robust to adversarial pressure when the actor using that system is not well-intentioned.
  - Steve_Omohundro 21 Aug 2024 18:39 UTC
    4 points
    2
    Parent
    Oh, I should have been clearer. In the first part, I was responding to his “rough approximation..[valid] over short periods of time.” claim for formal methods. I was arguing that we can at least copy current methods and in current situations get bridges which actually work rather robustly and for a long time.
    And, already, the formal version is better in many respects. For one, we can be sure it is being followed! So we get “correctness” in that domain. A second piece, I think is very important but I haven’t figured out how to communicate it. And that’s that the proof that a design satisfies design rules is a specific artifact that anyone can check. So it completely changes the social structure of the situation. Instead of having to rely on the “expert” who may or may not be competent, and may or may not be corrupt, each party is empowered with absolute guarantee of correctness. I think this alters many social processes dramatically, but it needs to be fleshed out and better explained.
    After that argument, I go to the next piece which you mention. Today’s engineering practices are not likely to be robust against powerful adversaries (eg. powerful AI or humans backed up by powerful AI). And I don’t think current practices can deal with that very well. In the AI safety space, the typical approach is “red teaming” where humans try to trigger AIs to produce bad scenarios and they see how easy it is and how powerful the attacks are. This can find problems but can’t show the absence of safety vulnerabilities.
    With mathematical proof, we can systematically consider the entire space of possible actions by adversaries in the specified class. Using techniques like “branch and bound”, we can systematically eliminate regions of the action space which are shown to be safe. And if the system is actually safe, there is a proof of that (by Godel’s Completeness theorem which says that any property which holds in all models of a set of axioms can be proven from those axioms). If the systems are complex, the proofs can be large, so there is value in “designing for verification”.
    That’s a possibility which provides actual safety against AI and other adversaries and provides detailed information about the value of different features, etc. Several groups are working right now to develop examples of this kind. Hopefully, the process can eventually be automated so that we can put it on the same timescale as AI advancement.
    - Steve Kommrusch 26 Aug 2024 21:29 UTC
      3 points
      −2
      Parent
      Steve, thanks for your explanations and discussion. I just posted a base reply about formal verification limitations within the field of computer hardware design. In that field, ignoring for now the very real issue of electrical and thermal noise, there is immense value in verifying that the symbolic 1′s and 0′s of the digital logic will successfully execute the similarly symbolic software instructions correctly. So the problem space is inherently simplified from the real world, and the silicon designers have incentive to build designs that are easy to test and debug, and yet only small parts of designs can be formally verified today. It would seem to me that, although formal verification will keep advancing, AI capabilities will advance faster and we need to develop simulation testing approaches to AI safety that are as robust as possible. For example, in silicon design one can make sure the tests have at least executed every line of code. One could imaging having a METR test suite and try to ensure that every neuron in a given AI model has been at least active and inactive. It’s not a proof, but it would speak to the breadth of the test suite in relation to the model. Are there robustness criteria for directed and random testing that you consider highly valuable without having a full safety proof?
      - Steve_Omohundro 28 Aug 2024 17:53 UTC
        5 points
        2
        Parent
        Testing is great for a first pass! And in non-critical and non-adversarial settings, testing can give you actual probabilistic bounds. If the probability distribution of the actual uses is the same as the testing distribution (or close enough to it), then the test statistics can be used to bound the probability of errors during use. I think that is why formal methods are so rarely used in software: testing is pretty good and if errors show up, you can fix them then. Hardware has greater adoption of formal methods because it’s much more expensive to fix errors after the fact.
        But the real problems arise from adversarial attacks. The statistical correctness of a system doesn’t matter to an adversary. They are looking for the weird outlier cases which will enable them to exploit the system (eg. inputs with non-standard characters that break the parser, or super-long inputs which overflow a buffer and enable unexpected access to memory, etc.). Testing can’t show the absence of flaws (unless every input is tested!).
        I think the increasing plague of cyberattacks is due to adversaries become more sophisticated in their search for non-standard ways of interacting with systems that expose their untested and unproven underbelly. But that kind of sophisticated attack requires highly skilled attackers and those are fortunately still rare.
        What is coming, however, are AI-powered cyberattack systems which know all of the standard flaws and vulnerabilities of systems, all of the published 1-day vulnerabilities, all of the latest social engineering techniques discussed on the dark web, and have full access to reverse engineering tools like Ghidra. Those AIs are likely being developed as we speak in various government labs (eg. here is a list of significant recent cybe incidents: https://www.csis.org/programs/strategic-technologies-program/significant-cyber-incidents ).
        How long before powerful cyberattack AIs are available on bittorrent to teenage hackers? So, I believe the new reality is that every system, software and hardware need to be proven correct and secure to have any confidence in it. To do that, we are likely to need to use AI-theorem provers and AI-verified software synthesis systems. Fortunately, many groups are showing rapid progress on those!
        But that doesn’t mean testing is useless. It’s very helpful during the development process and in better understanding systems. For final deployment in an environment with powerful AIs, however, I don’t think it’s adequate any more.
- Steve_Omohundro 21 Aug 2024 16:12 UTC
  4 points
  0
  Parent
  As a concrete example, consider the recent Francis Scott Key Bridge collapse where an out of control container ship struck one of the piers of the bridge. It killed six people, blocked the Port of Baltimore for 11 weeks, and will cost $1.7 billion to replace the bridge which will take four years.
  Could the bridge’s designers way back in 1977 have anticipated that a bridge over one of the busiest shipping routes in the United States might some day be impacted by a ship? Could they have designed it to not collapse in this circumstance? Perhaps shore up the base of its piers to absorb the impact of the ship?
  This was certainly a tragedy and is believed to have been an accident. But the cause was electrical problems on the ship. “At 1:24 a.m., the ship suffered a “complete blackout” and began to drift out of the shipping channel; a backup generator supported electrical systems but did not provide power to the propulsion system”
  How well designed was that ships generator and backup generator? Could adverarial attackers cause this kind of electrical blackout in other ships? Could remote AI systems do it? How many other bridges are vulnerable to impacts from out of control ships? What is the economic value at stake from this kind of flawed safety engineering? How much is it worth to create designs which provide guaranteed protections against this kind of flaw?
  - khafra 22 Aug 2024 10:49 UTC
    2 points
    0
    Parent
    That example seems particularly hard to ameliorate with provable safety. To focus on just one part, how could we prove the ship would not lose power long enough to crash into something? If you try to model the problem at the level of basic physics, it’s obviously impossible. If you model it at the level of a circuit diagram, it’s trivial—power sources on circuit diagrams do not experience failures. There’s no obviously-correct model granularity; there are schelling points, but what if threats to the power supply do not respect our schelling points?
    It seems to me that, at most, we could prove safety of a modeled power supply, against a modeled, enumerated range of threats. Intuitively, I’m not sure that compares favorably to standard engineering practices, optimizing for safety instead of for lowest-possible cost.
    - Steve_Omohundro 22 Aug 2024 17:34 UTC
      10 points
      2
      Parent
      In general, we can’t prevent physical failures. What we can do is to accurately bound the probability of them occurring, to create designs which limit the damage that they cause, and to limit the ability of adversarial attacks to trigger and exploit them. We’re advocating for humanity’s entire infrastructure to be upgraded with provable technology to put guaranteed bounds on failures at every level and to eliminate the need to trust potentially flawed or corrupt actors.
      In the case of the ship, there are both questions about the design of that ship’s components and its provenance. Why did the backup power not enable the propulsion system to stop? Why wasn’t there a “failsafe” anchor which drops if the systems become inoperable? Why didn’t the port have tugboats guiding risky ship departures? What was the history of that ship’s generators? Etc. With the kind of provable technology that Max and I outlined, it is possible to have provably trustable data about the components of the ship, about their manufacture, about their provenance since manufacture, about the maintenance history of the ship’s components, etc.
      The author of the main post and other critics argue against formal methods doing complex “magical” things like determining which DNA sequences are safe, how autonomous vehicles should navigate cities, or detecting bad thoughts in huge transformer neural nets. Someday these methods might help with some of those, but those aren’t the low hanging fruit we are proposing. In some sense we mainly want to use proof for much more mundane things. What Max and I are arguing for are mechanisms to create software, hardware, and social designs which aren’t exploitable by adversarial AIs and to create infrastructure that provides guarantees about its state and behavior. Nothing we are proposing requires sophisticated mathematics that today’s grad students couldn’t do. Nothing requires new physics or radically new engineering principles. Rather, it is a way to organize current technologies to increase trust and eliminate vulnerabilities.
      These technologies enable us to eliminate the need to trust third parties: Was a computation performed accurately? Were there bugs in the program? What data was used to train this model or estimate this probability? What probabilistic program or neural net was used? Was the training done correctly? What is the history of this component? What evidence is there that it was manufactured correctly? These and thousands more cases will enable us to build up a robust infrastructure which is provably not vulnerable to AI-driven attack.
      A core aspect of this is that we can use untrusted powerful AIs running on untrusted datacenters in untrusted countries to help us build completely trusted software, hardware, and social protocols. The idea is to precisely specify a task (eg. software spec, hardware spec, solve a mathematically encoded problem, etc.) and have the untrusted AI generate both and answer and a proof (in a system like Lean) that the answer solves the precisely specified problem or design task. We can cheaply and completely reliably check the proof. If it verifies, then we can fully trust the results from the untrusted AI. This enables us to bootstrap the current mess of untrusted and unreliable AIs, flaky and insecure hardware, untrustable people and groups, etc. to build up a *fully* trustable infrastructure. The power and importance of this is immense!
      - khafra 27 Aug 2024 6:05 UTC
        2 points
        0
        Parent
        Here’s the intuition that’s making me doubt the utility of provably correct system design to avoiding bridge crashes:
        I model the process leading up to a ship that doesn’t crash into a bridge as having many steps.
        
        1. Marine engineers produce a design for a safe ship
        2. Management signs off on the design without cutting essential safety features
        3. Shipwrights build it to spec without cutting any essential corners
        4. The ship operator understands and follows the operations and maintenance manuals, without cutting any essential corners
        5. Nothing out-of-distribution happens over the lifetime of the ship.
        
        And to have a world where no bridges are taken out by cargo ships, repeat that 60,000 times.
        
        It seems to me that provably safe design can help with step 1--but it’s not clear to me that step 1 is where the fault happened with the Francis Scott Key bridge. Engineers can and do make bridge-destroying mistakes (I grew up less than 50 miles from the Tacoma Narrows bridge), but that feels rare to me compared to problems in the other steps: management does cut corners, builders do slip up, and O&M manuals do get ignored.
        
        With verifiable probabilities of catastrophe, maybe a combination of regulation and insurance could incentivize makers and operators of ships to operate safely—but insurers already employ actuaries to estimate the probability of catastrophe, and it’s not clear to me that the premiums charged to the MV Dali were incorrect. As for the Francis Scott Key, I don’t know how insuring a bridge works, but I believe most of the same steps and problems apply.
        
        (Addendum: The new Doubly-Efficient Debate paper on Google’s latest LW post might make all of these messy principal-agent human-corrigibility type problems much more tractable to proofs? Looks promising.)
        Steve_Omohundro 28 Aug 2024 18:14 UTC
        9 points
        2
        Parent
        I totally agree in today’s world! Today, we have management protocols which are aimed at requiring testing and record keeping to ensure that boats and ships in the state we would like them to be. But these rules are subject to corruption and malfeasance (such as the 420 Boeing jets which incorporated defective parts and yet which are currently flying with passengers: https://doctorow.medium.com/https-pluralistic-net-2024-05-01-boeing-boeing-mrsa-2d9ba398bd54 )
        But it appears we are rapidly moving to a world in which much of the physical labor will be done by robots and in which each physical system will have a corresponding “digital twin” (eg. https://www.nvidia.com/en-us/omniverse/solutions/digital-twins/ ).
        In that world, we can implement provable formal rules governing every system, from raw materials, to manufacture, to supply chain, to operations, and to maintenance.
        In an AI world, much more sophisticated malfeasance can occur. Formal models of domains with proofs of adherence to rules and protection against adversaries is the only way to ensure our systems are safe and effective.

Steve_Omohundro comments on Limitations on Formal Verification for AI Safety

Limitation 1 – We will not obtain strong proofs or formal guarantees about the behavior of AI systems in the physical world. At best we may obtain guarantees about rough approximations of such behavior, over short periods of time.