Gerald Monroe comments on Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof

Gerald Monroe 16 Feb 2023 20:26 UTC
2 points
0
Ok. So you’re working from the perspective of “can we make a model interpretable and to formally prove properties”.

I am working from the perspective of “obviously you can’t make a competitive nn model with billions of weights provable, but if you optimize it certain ways you could make an overall machine with a policy trapped in a local minima.”

Then, with knowledge of what sort of inputs produce stable and reliable outputs, measured empirically in simulation, let’s auto shut down the machine when it leaves the “safe” states.

And let’s stack probabilities in our favor. If we know blackbox model A fails about 0.01 percent of the time on inputs from our test distribution, let’s train model B such that it often succeeds where A fails. (empirically verifiable)

The overall framework—with an A and a B model, and a whole ecosystem so the overall machine doesn’t segfault in a driver and kill people because nothing is in control of the hardware—is where you need formally proven code.

This kind of “down and dirty ” engineering based on measurements is basically how we do it in most production software roles. It’s not useful to rely on theory alone, for the reasons you mention.