Steve_Omohundro comments on Limitations on Formal Verification for AI Safety

Steve_Omohundro 28 Aug 2024 18:35 UTC
5 points
2
I agree that would be better than what we usually have now! And is more in the “Swiss Cheese” approach to security. From a practical perspective, we are probably going to have do that for some time: components with provable properties combined in unproven ways. But every aspect which is unproven is a potential vulnerability.
The deeper question is whether there are situations where it has to be that way. Where there is some barrier to modeling the entire system and formally combining correctness and security properties of components to obtain them for the whole system.
Certainly there are hardware and software components whose detailed behavior is computationally complex to predict in advance (eg. searching for solutions to SAT problems or inverting hash functions). So you are unlikely to be able to prove theorems like “For every specification of these n bits in this SAT problem, it will take f(n) time to discover a satisfying value for the remaining bits”. But that’s fine! It’s just that you shouldn’t make the correctness or security of your system depend on that! For example, you might put a time bound on the search and have a failsafe path if it doesn’t succeed by then. That software does have a provable time bound.
So, in general, systems need to be designed to be correct and safe. If you can’t put provable bounds on the safety of a system, then I would argue that you have no business exposing the public to that system.
It would be great to start collecting examples of subcomponents or compositional designs which are especially difficult to prove properties about. My sense is that virtually all of these formal analyses will be done by AIs and not by humans. And I think it will be important to develop libraries and models of problems which are easy to solve formally and provide formal guarantees about. And those which are more difficult.
Thinking about the software verification case, I would argue that every decently written piece of software today, the programmer has an internal argument in their head as to why it is correct and not vulnerable to attacks. Humans are fallible, so their argument may not be correct. But if it is correct, then it shouldn’t be difficult to formalize it into a precise formal proof. The “de Bruijn Factor” (https://www.cs.ru.nl/~freek/factor/factor.pdf ) measures how much bigger a formal proof of something is than an informal description. It seems to be between 4 and 10 in current formal systems. So, if a human programmer has confidence in the correctness and security of his code, it should only be a small factor more work for an AI to formally prove that. If the programmer doesn’t have that confidence, then I think we have no business deploying it anywhere it might harm humans.
- Steve_Omohundro 28 Aug 2024 20:39 UTC
  3 points
  2
  Parent
  Thinking more about the question of are there properties which we believe but for which we have no proof. And what do we do about those today and in an intended provable future?
  I know of a few examples, especially in cryptography. One of the great successes of theoretical cryptography was the reduction the security of a whole bunch of cryptographic constructs to a single one: the existence of one way functions which are cheap to compute but expensive to invert: https://en.wikipedia.org/wiki/One-way_function That has been a great unifying discovery there and the way the cryptographers deal with it is that they just add the existence of one way functions as an extra axioms to their formal system! It does mean that if it turns out not to be true, then a lot of their proven secure system may actually not be. Fortunately, there is “Information-Theoretic Cryptography” https://www.cambridge.org/us/universitypress/subjects/engineering/communications-and-signal-processing/information-theoretic-cryptography?format=HB which doesn’t rely on any unproven assumptions but is somewhat more inconvenient to use.
  Then there’s “Public Key Cryptography” which rests on much dicier assumptions (such as factoring being hard). We already know that a bunch of those assumptions no longer hold in quantum computation but NIST recently announced 3 “post-quantum” standards https://csrc.nist.gov/projects/post-quantum-cryptography but I think there is still quite a lot of worry about them
  More generally, mathematicians often have statements which they believe to be true (eg. P!=NP, the Riemann hypothesis, and others: https://en.wikipedia.org/wiki/List_of_unsolved_problems_in_mathematics On what evidence do mathematicians believe these statements? What happens if they use them in proofs of other statements?
  Timothy Gowers wrote an insightful essay into these questions: “What Makes Mathematicians Believe Unproved Mathematical Statements?” https://www.semanticscholar.org/paper/What-Makes-Mathematicians-Believe-Unproved-Gowers/b17901cece820de845e57456eda06f892b5ba199
  What does this mean for provable safety? One can always add any beliefs one likes to one’s axioms and prove things from there! If the axioms you have added turn out to be false, that can undo the guarantees about anything which depended on them. That suggests that one should try to limit unproven assumptions as much as possible! And it also seems to be a great strategy to follow the cryptographers and try to capture the essence of an assumption in a “standard” assumption whose validity can be tested in a variety of contexts.
  Physics aims to have a precise fundamental theory and to mathematically derive the consequences of that theory to explain all phenomena. From a mathematical point of view, physicists are notoriously “mathematically sloppy” using techniques which may often give the right answer but which may not be provable (eg. different perturbation methods, renormalization, path integrals, etc.) But, fortunately, more mathematically inclinded physicists have repeatedly come along afterwards and created precise formal models within which the physics derivations are justified.
  Two big leaps in the tower of physics models are from deterministic quantum equations (eg. Schrodinger’s equation) to statistical representations of measurements and from particle descriptions of matter (eg. an ideal gas) to statistical mechanics representations based on probabilities. Huge literatures explore the meaning and character of those two leaps but my sense is that in general we can’t yet formally justify them, but that they are extremely well justified empirically. Physicists call these the assumptions of “Quantum decoherence” https://en.wikipedia.org/wiki/Quantum_decoherence and “Stosszahlansatz” or “Molecular chaos” https://en.wikipedia.org/wiki/Molecular_chaos
  How do we deal with them formally? I think we have to do what physicists do and just add those assumptions as axioms to the formal models. This is risky in an adversarial context. We have to watch out for powerful adversaries (ie. powerful AIs) which can control matter at a level which enables them to violate these assumptions. Doesn’t seem likely to me, but we must be ever vigilant!
  Something I would like to do in these situations but I don’t think we have the philosophical underpinnings for is to have precise provable estimates of the probabilities of these being true or false. Gowers makes some attempt at that but I’m not sure it’s formal yet. It’s a bit weird, it would be a probability for something which is objectively either true or false. So it would be a measure of our knowledge of the state. But it would be valuable for AI safety to have a justifiable measure for how much we need to worry about an adversary being able to violate our assumptions. And, ultimately, our current laws of physics are of this character. It would be great to have a precise measure of our confidence in various physical properties like symmetries (eg. time-invariance, translation invariance, rotational invarianc, etc.), conservation laws (mass/energy, momentum, lepton-number, etc.), etc.