Steve_Omohundro comments on Provably Safe AI: Worldview and Projects

Steve_Omohundro 12 Aug 2024 2:38 UTC
3 points
0
I am a big fan of verification in principle, but don’t think it’s feasible in such broad contexts in practice. Most of these proposals are far too vague for me to have an opinion on, but I’d be happy to make bets that
- (3) won’t happen (with nontrivial bounds) due to the difficulty of modeling nondeterministic GPUs, varying floating-point formats, etc.
- (8) won’t be attempted, or will fail at some combination of design, manufacture, or just-being-pickable.
- with some mutually agreed operationalization against the success of other listed ideas with a physical component
I’m afraid I’m not into betting and I’m not saying these projects will happen, but rather that they should happen. And I’m not saying they are easy but rather that they are important and I don’t see any huge barriers to doing them.
In general, non-deterministic hardware isn’t a problem for formalization. We’re not trying to simulate these systems, we’re trying to put bounds on their possible behaviors. In general, proofs work with enclosures that are guaranteed to contain the actual behavior. As far as I know, the various floating point formats are all precisely defined, so there’s no problem there. I know that some processor specs stated that their implementation of functions like sin and cos were only guaranteed up the last bit or something. That’s an annoyance but not a real problem for formalization. A simple approach is to maintain intervals which are guaranteed to contain the actual values and to prove that output intervals don’t overlap the unsafe region.
Probabilistic programs do provide greater challenges. It’s not a problem to formally model and prove properties of probability distributions. But it’s often no longer possible to provide 100% guarantees of desired safety properties. The approach of precise probabilistic theories like “Probably approximately correct learning” (https://en.wikipedia.org/wiki/Probably_approximately_correct_learning) is to provide a provable bound on the probability of failure.
There is subtlety in adversarial situations where there is a difference between uncertain actions which the adversary can’t affect for which an ordinary probabilistic analysis suffices and those where the adversary can influence apparently random variables. For example, I believe we need to create formal models of all of the engineering disciplines (eg. mechanical engineering, electrical engineering, chemical engineering, etc.). The rules that those engineers use are already quite formal and there are heuristic arguments as to why they are grounded in fundamental physics. But in an adversarial context, an adversary may be able to influence a deeper level of physics in a way that can break the assumptions of the model. For example, mechanical engineers have models of the elastic properties of steel beams but if an adversary can alter the manufacture of those beams in a way that makes them vulnerable to certain kinds of failure, then the mechanical engineer’s assumptions can be invalidated. So the formal models also need to model potential vulnerabilities in the assumptions of the higher level models.
I believe that these developments are not optional. It’s just a question of whether humanity prepares now for what’s coming or waits until after the basic assumptions of our infrastructure and society are upended.
- ryan_greenblatt 12 Aug 2024 3:17 UTC
  11 points
  16
  Parent
  A simple approach is to maintain intervals which are guaranteed to contain the actual values and to prove that output intervals don’t overlap the unsafe region.
  
  For actual inference stacks in use (e.g. llama-3-405b 8 bit float) interval propagation will blow up massively and result in vacuous bounds. So, you’ll minimally have to assume that floating point approximation is non-adversarial aka random (because if it actually was adversarial, you would probably die!).
  
  (My supporting argument for this is that you see huge blow ups on interval arguments for even tiny models on toy tasks^[1] and that modern inference stacks have large amounts of clearly observable floating point non-determinism, which means that the maximum amount of deviation would actually have strong effects if all errors (including across token positions) were controlled by an adversary.)
  ↩︎
  Though note that the interval propagation issues in this case aren’t from floating point rounding and are instead from imperfections in the model weights implementing the algorithm.
  - Steve_Omohundro 12 Aug 2024 4:40 UTC
    3 points
    −3
    Parent
    Intervals are often a great simple form of “enclosure” in continuous domains. For simple functions there is also “interval arithmetic” which cheaply produces a bounding interval on the output of a function given intervals on its inputs: https://en.wikipedia.org/wiki/Interval_arithmetic But, as you say, for complex functions it can blow up. For a simple example of why, consider the function “f(x)=x-x” evaluated on the input interval [0,1]. In the simplest interval arithmetic, the interval for subtraction has to bound the worst possible members of the intervals of its two inputs. In this case that would be a lower bound of “0-1″ and an upper bound of “1-0” producing the resulting interval: [-1,1]. But, of course, “x-x” is always 0, so this is huge over-approximation. People have developed all kinds of techniques for capturing the correlations between variables in evaluating circuits on intervals. And you can always shrink the error by splitting the input intervals and doing “branch and bound”. But all of those are just particular implementation choices in proving bounds on the output of the function. Advanced AI theorem provers (like AlphaProof) can use very sophisticated techniques to accurately get the true bound on the output of a function.
    But, it may be that it’s not a fruitful approach to try to bound the behavior of complex neural nets such as transformers. In our approach, we don’t need to understand or constrain a complex AI generating a solution or a control policy. Rather, we require the AI to generate a program, control policy, or simple network for taking actions in the situation of interest. And we force it to generate a proof that it satisfies given safety requirements. If it can’t do that, then it has no business taking actions in a dangerous setting.
    - ryan_greenblatt 12 Aug 2024 6:07 UTC
      12 points
      14
      Parent
      
      Rather, we require the AI to generate a program, control policy, or simple network for taking actions in the situation of interest. And we force it to generate a proof that it satisfies given safety requirements. If it can’t do that, then it has no business taking actions in a dangerous setting.
      
      This seems near certain to be cripplingly uncompetitive^[1] even with massive effort on improving verification.
      
      ↩︎
      If applied to all potentialy dangerous applications.
    - ryan_greenblatt 12 Aug 2024 6:02 UTC
      11 points
      7
      Parent
      I agree you can do better than naive interval propagation by taking into account correlations. However, it will be tricky to get a much better bound while avoiding having this balloon in time complexity (all possible correlations requires exponentional time).
      
      More strongly, I think that if an adversary controlled the non-determinism (e.g. summation order) in current efficient inference setups, they would actually be able to strongly influence the AI to an actualy dangerous extent—we are likely to depend on this non-determinism being non-adversarial (which is a reasonable assumption to make).
      
      (And you can’t prove a false statement...)
      
      See also heuristic arguments which try to resolve this sort of issue by assuming a lack of structure.