redlizard comments on Limitations on Formal Verification for AI Safety

redlizard 20 Aug 2024 22:03 UTC
21 points
9

One overriding perspective in Max and my approach is that we need to design our systems so that their safety can be formally verified. In software, people often bring up the halting problem as an argument that general software can’t be verified. But we don’t need to verify general software, we are designing our systems so that they can be verified.

I am a great proponent of proof-carrying code that is designed and annotated for ease of verification as a direction of development. But even from that starry-eyed perspective, the proposals that Andrew argues against here seem wildly unrealistic.

A proof-carrying piece of C code can prove that it is not vulnerable to any buffer overflows, or that it will never run motor 1 and motor 2 simultaneously in opposite directions. A bit more ambitiously, it could contain a full proof of a complete behavioral specification, proving that the software will perform a certain set of behaviors at all times, which as a corollary also implies that it is proof against a large body of security threats. This is not something we can really manage in practice yet, but it’s well within the extrapolated range of current techniques. We could build a compiler that will only compile software that contains a proof matching some minimum necessary specification, too.

Now imagine trying to prove that my program doesn’t perform any sequence of perfectly innocuous interactions that happens to trigger a known or unknown bug in a mysql 5.1.32 server on the internet somewhere. How would you specify that? You can specify that the program doesn’t do anything that might affect anything ever (though this leads to the well-known boxing problems); but if I want my program to have some nonzero ability to affect things in non-malicious ways, how would you specify that it doesn’t do anything that might break something in some other part of the world in some unexpected way, including any unknown zero-days in mysql 5.1.32 servers? Presumably my proof checker doesn’t contain the full code of all other systems on the internet my software might plausibly interact with. How could I specify a requirement like this, let alone prove it?

Proving that a piece of software has the behavior I want, or the behavior I allow, is something that can be done by carefully annotating my code with lemmas and contracts and incremental assumptions that together build up to a proof of the behavior I want. Proving that the software will have the behavior I want no matter what conditions you could possibly throw at it sounds harder but is actually mostly the same problem—and so I would expect that proofs of almost-perfect security of a piece of software would not be that difficult either. But that is security of the software against attacks that might threaten the desired behavior of the software. Demonstrating that the software is not a threat to something else somewhere is another matter entirely, as this requires first knowing and encoding in the proofs all the ways in which the rest of the world might be negatively affected by actions-in-general. Not just superficially, either, if you want the proof to rule out the mysql 5.1.32 zero-day that just uses a sequence of normal interactions that should be perfectly innocent but in practice aren’t. This is proving a negative in the worst possible way; to prove something like this, you would need to quantify over all possible higher-order behaviors the program doesn’t have. I don’t see how any of the real-or-imagined formal verification techniques I have ever heard about could possibly do anything even vaguely like this.

All the above is the world of software versus software, where even the external environment can be known to the last bit, with crisp behavior that can be analyzed and perfectly simulated even to the point where you can reproduce the mysql 5.1.32 bugs. Doing the same thing with biology would be a whole other order of magnitude in challenge level. Proving that a certain piece of synthesized DNA is not going to form a threat to human biology in some indirect way is essentially analogous to the mysql exploitation problem above, but much, much harder. Here, too, you would need to have a proof that quantifies over all possible higher order behaviors the piece of DNA doesn’t have, all applied to the minor inconvenience that is the much murkier world of biology.

Unless I missed some incredible new developments in the field of formal verification, quantification over all possible higher order patterns and consequences of a system being analyzed is something that is absolutely out of reach of any formal verification techniques I have heard of. But if you do know of any techniques that could challenge this barrier, do tell :)
What links here?
- Noosphere89's comment on The Hopium Wars: the AGI Entente Delusion by Max Tegmark (14 Oct 2024 1:51 UTC; 5 points)