Gerald Monroe comments on Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof

Gerald Monroe 16 Feb 2023 17:59 UTC
2 points
0
Epistemic status: I work on a software stack that is used in autonomous cars
Quinn, if you try to create a reasonable near term alignment strategy, you might end up with a policy such as:
1. A software framework that selects, from the output of multiple models, an action to take the next frame that maxes some heuristic of expected consequence and probability of success
2. A hierarchy of implementations, where some simpler and more robust implementations are allowed to take control of the output in some circumstances
3. Empirical estimates of the p(success). That is, with input states similar to the current state, how often was model n correct? Note you can’t trust model n to estimate it’s own correctness.
4. Tons of boilerplate code to bolt the above together—memory handling interfaces, threadpools, device drivers, and so on.
So while you may not be able to formally prove that the above restrictions will limit the model in all circumstances, you could prove more pendantic properties. Does module X leak memory. Does it run in deterministic time. Does it handle all possible binary states of an input variable. Does an optimized implementation produce the same truth table of outputs as the reference implementation, for all inputs.
These are what formal analysis is for, and an actual software framework that might house neural network models—the code outside the model—is made of thousands of separable functions that can be proven from the above.
Did you find this to be the case? If not, why not?
- Quinn 16 Feb 2023 20:22 UTC
  1 point
  0
  Parent
  I like your strategy sketch.
  
  There’s the guard literature in RL theory which might be nice for step one.
  
  I agree with step two, favoring non-learned components over learned components whenever possible. I deeply appreciate thinking of learned components as an attack surface; shrink it when you can.
  
  Really excited about step three. Estimation and evaluation and forecasting is deeply underrated. It’s weird to say because relative to the rest of civilization the movement overrates forecasting, and maybe the rest of the world will be vindicated and forecasting will go the way of esperanto, but I don’t think we should think in relative terms. Anyway, I feel like if you had the similarity metric and control that you want for this, it’d be easier to just say “don’t do anything drastic” in a blanket way (of course competitive pressures to be smarter and bolder than anyone else would ruin this anyway).
  
  Step four is more oof an implementation reality than a part of the high level strategy, but yeah.
  
  Thanks a ton for the comment, it’s exactly the type of thing I wanted my post to generate.
- Quinn 16 Feb 2023 20:15 UTC
  1 point
  0
  Parent
  Agree. The product stack has non-learned components in it. Let’s throw the kitchen sink at those components.
  
  Some techniques that go under the banner of formal verification are not applied type theory (deduction), usually in the model checking family, or maybe with abstract interpretation. For the record I only have my applied type theory experience, I’m a coq dev, I’ve spent a cumulative like two hours over the past 3 years pondering abstract interpretation or model checking. That’s why I said “formal proof” in the title instead of “formal verification”. I could’ve called it “proof engineering” or “applied type theory” maybe, in the title, but I didn’t.
  
  In order for there to be alpha in applied type theory for the learned components, a great deal of things need to fall into place mostly in mechinterp, then some moderate applied type theory work, then a whole lot of developer experience / developer ecosystems work. That’s what I believe.
  - Gerald Monroe 16 Feb 2023 20:26 UTC
    2 points
    0
    Parent
    Ok. So you’re working from the perspective of “can we make a model interpretable and to formally prove properties”.
    
    I am working from the perspective of “obviously you can’t make a competitive nn model with billions of weights provable, but if you optimize it certain ways you could make an overall machine with a policy trapped in a local minima.”
    
    Then, with knowledge of what sort of inputs produce stable and reliable outputs, measured empirically in simulation, let’s auto shut down the machine when it leaves the “safe” states.
    
    And let’s stack probabilities in our favor. If we know blackbox model A fails about 0.01 percent of the time on inputs from our test distribution, let’s train model B such that it often succeeds where A fails. (empirically verifiable)
    
    The overall framework—with an A and a B model, and a whole ecosystem so the overall machine doesn’t segfault in a driver and kill people because nothing is in control of the hardware—is where you need formally proven code.
    
    This kind of “down and dirty ” engineering based on measurements is basically how we do it in most production software roles. It’s not useful to rely on theory alone, for the reasons you mention.