Davidmanheim comments on Limitations on Formal Verification for AI Safety

Davidmanheim 8 Sep 2024 11:30 UTC
LW: 5 AF: 2
3
AF
I have a lot more to say about this, and think it’s worth responding to in much greater detail, but I think that overall, the post criticizes Omhundro and Tegmark’s more extreme claims somewhat reasonably, though very uncharitably, and then assumes that other proposals which seem to be related, especially Dalyrymple et al. approach, are essentially the same, and doesn’t engage with the specific proposal at all.
To be very specific about how I think the post in unreasonable, there are a number of places where a seeming steel-man version of the proposals are presented, and then this steel-manned version, rather than the initial proposal for formal verification, is attacked. But this amounts to a straw-man criticism of the actual proposals being discussed!

For example, this post suggests that arbitrary DNA could be proved safe by essentially impossible modeling (“on-demand physical simulations of entire human bodies (with their estimated 36 trillion cells [9]), along with the interactions between the cells themselves and the external world and then run those simulations for years”). This is true, that would work—but the proposal ostensibly being criticized was to check narrower questions about whether DNA synthesis is being used to produce something harmful. And Dalyrmple et al explained explicitly what they might have included elsewhere in the paper (“Examples include machines provably impossible to login to without correct credentials, DNA synthesizers that provably cannot synthesize certain pathogens, and AI hardware that is provably geofenced, time-limited (“mortal”) or equipped with a remote-operated throttle or kill-switch. Provably compliant sensors can be specified to ensure “zeroization”, in which tampering with PCH is guaranteed to cause detection and erasure of private keys.”)
- Steve_Omohundro 8 Sep 2024 16:33 UTC
  8 points
  1
  Parent
  People seem to be getting the implication we intended backwards. We’re certainly not saying “For any random safety property you might want, you can use formal methods to magically find the rules and guarantee them!” What we are saying is “If you have a system which actually guarantees a safety property, then there is formal proof of that and there are many many benefits to making that explicit, if it is practical.” We’re not proposing any new physics, any new mathematics, or even any new AI capabilities other than further development of current capabilities in theorem proving, verified software synthesis, and autoformalization.
  Humanity is in a crisis right now! Even without AI we have disasters like the $5 billion CrowdStrike flaw a few weeks ago, many many cyberattacks disabling critical systems, Boeing airplanes falling apart in the sky, etc. As open source AI cyberattack models advance, every one of today’s flaws and security holes are likely to be exploited by a wide variety of malicious actors. We have clear descriptions of desired safety properties for many of today’s issues and we have design rules which have been developed to mitgate the problems. But today’s software development and engineering practices aren’t leading to the safety we need! Software engineers wing it and introduce flaws in their systems and security holes are left unfixed for decades. Boeing outsources the manufacture of jet components to underpaid and unsupervised third parties.
  By creating explicit formal representations of these rules, and formal “digital twins” of our systems, we can obtain guarantees that they are being followed. Our papers and my talks describe many techniques for doing this. But we are just scratching the surface. We need many many more people thinking about how to achieve actual safety in the face of today’s and tomorrow’s AIs.
  I’m a little bit shocked at the low level of curiosity and creativity I’m seeing in these discussions. I have not seen any other proposals for actually dealing with the actual problems that real AI systems are likely to cause in the next few years. Attempts at alignment, red teaming, etc. are all very interesting and valuable but do nothing to address the hundreds of millions of open source AIs which are almost as powerful as the frontier closed models and which are easily fine tuned for cyberattack and other harmful actions.
  The general approach we have outlined is not optional! Godel’s completeness theorem tells us that there is a formal proof of any safety property which actually holds for a system. For complex systems, they are very unlikely to be accidentally safe. So designing and building them with explicit formal models will be necessary. If humans don’t do it, then powerful AIs certainly will. But, in that case, we may have very little input into exactly what properties will be guaranteed.
  Our proposal is not simple! But it doesn’t rely on any exotic new ideas. It does require creativity and dedication to actually work to ensure that humanity survives the next decade. There would be tremendous benefit in simply precisely encoding today’s safety criteria and in creating systems that guarantee that those criteria are followed. With some future creativity we can go further and use these techniques to develop much more accurate safety criteria and better designs. I would have expected huge interest and forward movement on this. I’m sad to not see much of that so far.