khafra comments on Limitations on Formal Verification for AI Safety

khafra 22 Aug 2024 10:49 UTC
2 points
0
That example seems particularly hard to ameliorate with provable safety. To focus on just one part, how could we prove the ship would not lose power long enough to crash into something? If you try to model the problem at the level of basic physics, it’s obviously impossible. If you model it at the level of a circuit diagram, it’s trivial—power sources on circuit diagrams do not experience failures. There’s no obviously-correct model granularity; there are schelling points, but what if threats to the power supply do not respect our schelling points?
It seems to me that, at most, we could prove safety of a modeled power supply, against a modeled, enumerated range of threats. Intuitively, I’m not sure that compares favorably to standard engineering practices, optimizing for safety instead of for lowest-possible cost.
- Steve_Omohundro 22 Aug 2024 17:34 UTC
  10 points
  2
  Parent
  In general, we can’t prevent physical failures. What we can do is to accurately bound the probability of them occurring, to create designs which limit the damage that they cause, and to limit the ability of adversarial attacks to trigger and exploit them. We’re advocating for humanity’s entire infrastructure to be upgraded with provable technology to put guaranteed bounds on failures at every level and to eliminate the need to trust potentially flawed or corrupt actors.
  In the case of the ship, there are both questions about the design of that ship’s components and its provenance. Why did the backup power not enable the propulsion system to stop? Why wasn’t there a “failsafe” anchor which drops if the systems become inoperable? Why didn’t the port have tugboats guiding risky ship departures? What was the history of that ship’s generators? Etc. With the kind of provable technology that Max and I outlined, it is possible to have provably trustable data about the components of the ship, about their manufacture, about their provenance since manufacture, about the maintenance history of the ship’s components, etc.
  The author of the main post and other critics argue against formal methods doing complex “magical” things like determining which DNA sequences are safe, how autonomous vehicles should navigate cities, or detecting bad thoughts in huge transformer neural nets. Someday these methods might help with some of those, but those aren’t the low hanging fruit we are proposing. In some sense we mainly want to use proof for much more mundane things. What Max and I are arguing for are mechanisms to create software, hardware, and social designs which aren’t exploitable by adversarial AIs and to create infrastructure that provides guarantees about its state and behavior. Nothing we are proposing requires sophisticated mathematics that today’s grad students couldn’t do. Nothing requires new physics or radically new engineering principles. Rather, it is a way to organize current technologies to increase trust and eliminate vulnerabilities.
  These technologies enable us to eliminate the need to trust third parties: Was a computation performed accurately? Were there bugs in the program? What data was used to train this model or estimate this probability? What probabilistic program or neural net was used? Was the training done correctly? What is the history of this component? What evidence is there that it was manufactured correctly? These and thousands more cases will enable us to build up a robust infrastructure which is provably not vulnerable to AI-driven attack.
  A core aspect of this is that we can use untrusted powerful AIs running on untrusted datacenters in untrusted countries to help us build completely trusted software, hardware, and social protocols. The idea is to precisely specify a task (eg. software spec, hardware spec, solve a mathematically encoded problem, etc.) and have the untrusted AI generate both and answer and a proof (in a system like Lean) that the answer solves the precisely specified problem or design task. We can cheaply and completely reliably check the proof. If it verifies, then we can fully trust the results from the untrusted AI. This enables us to bootstrap the current mess of untrusted and unreliable AIs, flaky and insecure hardware, untrustable people and groups, etc. to build up a *fully* trustable infrastructure. The power and importance of this is immense!
  - khafra 27 Aug 2024 6:05 UTC
    2 points
    0
    Parent
    Here’s the intuition that’s making me doubt the utility of provably correct system design to avoiding bridge crashes:
    I model the process leading up to a ship that doesn’t crash into a bridge as having many steps.
    
    1. Marine engineers produce a design for a safe ship
    2. Management signs off on the design without cutting essential safety features
    3. Shipwrights build it to spec without cutting any essential corners
    4. The ship operator understands and follows the operations and maintenance manuals, without cutting any essential corners
    5. Nothing out-of-distribution happens over the lifetime of the ship.
    
    And to have a world where no bridges are taken out by cargo ships, repeat that 60,000 times.
    
    It seems to me that provably safe design can help with step 1--but it’s not clear to me that step 1 is where the fault happened with the Francis Scott Key bridge. Engineers can and do make bridge-destroying mistakes (I grew up less than 50 miles from the Tacoma Narrows bridge), but that feels rare to me compared to problems in the other steps: management does cut corners, builders do slip up, and O&M manuals do get ignored.
    
    With verifiable probabilities of catastrophe, maybe a combination of regulation and insurance could incentivize makers and operators of ships to operate safely—but insurers already employ actuaries to estimate the probability of catastrophe, and it’s not clear to me that the premiums charged to the MV Dali were incorrect. As for the Francis Scott Key, I don’t know how insuring a bridge works, but I believe most of the same steps and problems apply.
    
    (Addendum: The new Doubly-Efficient Debate paper on Google’s latest LW post might make all of these messy principal-agent human-corrigibility type problems much more tractable to proofs? Looks promising.)
    - Steve_Omohundro 28 Aug 2024 18:14 UTC
      9 points
      2
      Parent
      I totally agree in today’s world! Today, we have management protocols which are aimed at requiring testing and record keeping to ensure that boats and ships in the state we would like them to be. But these rules are subject to corruption and malfeasance (such as the 420 Boeing jets which incorporated defective parts and yet which are currently flying with passengers: https://doctorow.medium.com/https-pluralistic-net-2024-05-01-boeing-boeing-mrsa-2d9ba398bd54 )
      But it appears we are rapidly moving to a world in which much of the physical labor will be done by robots and in which each physical system will have a corresponding “digital twin” (eg. https://www.nvidia.com/en-us/omniverse/solutions/digital-twins/ ).
      In that world, we can implement provable formal rules governing every system, from raw materials, to manufacture, to supply chain, to operations, and to maintenance.
      In an AI world, much more sophisticated malfeasance can occur. Formal models of domains with proofs of adherence to rules and protection against adversaries is the only way to ensure our systems are safe and effective.