Proveably Safe Self Driving Cars [Modulo Assumptions]

I’ve seen a fair amount of skepticism about the “Provably Safe AI” paradigm, but I think detractors give it too little credit. I suspect this is largely because of idea inoculation—people have heard an undeveloped or weak man version of the idea, for example, that we can use formal methods to state our goals and prove that an AI will do that, and have already dismissed it. (Not to pick on him at all, but see my question for Scott Aaronson here.)

I will not argue that Guaranteed Safe AI solves AI safety generally, or that it could do so—I will leave that to others. Instead, I want to provide a concrete example of a near-term application, to respond to critics who say that proveability isn’t useful because it can’t be feasibly used in real world cases when it involves the physical world, and when it is embedded within messy human systems. [Edit to add: Doing this does require assumptions in addition to simple provability, as outlined below, so as @tdietterich, suggested, this leads to the amended title.]

I am making far narrower claims than the general ones which have been debated, but at the very least I think it is useful to establish whether this is actually a point of disagreement. And finally, I will admit that the problem I’m describing would be adding proveability to a largely solved problem, but it provides a concrete example for where the approach is viable.

A path to provably safe autonomous vehicles

To start, even critics agree that formal verification is possible, and is already used in practice in certain places. And given (formally specified) threat models in different narrow domains, there are ways to do threat and risk modeling and get different types of guarantees. For example, we already have proveably verifiable code for things like microkernels, and that means we can prove that buffer overflows, arithmetic exceptions, and deadlocks are impossible, and have hard guarantees for worst case execution time. This is a basis for further applications—we want to start at the bottom and build on provably secure systems, and get additional guarantees beyond that point. If we plan to make autonomous cars that are provably safe, we would build starting from that type of kernel, and then we “only” have all of the other safety issues to address.

Secondly, everyone seems to agree that provable safety in physical systems requires a model of the world, and given the limits of physics, the limits of our models, and so on, any such approach can only provide approximate guarantees, and proofs would be conditional on those models. For example, we aren’t going to formally verify that Newtonian physics is correct, we’re instead formally verifying that if Newtonian physics is correct, the car will not crash in some situation.

Proven Input Reliability

Given that, can we guarantee that a car has some low probability of crashing?

Again, we need to build from the bottom up. We can show that sensors have some specific failure rate, and use that to show a low probability of not identifying other cars, or humans—not in the direct formal verification sense, but instead with the types of guarantees typically used for hardware, with known failure rates, built in error detection, and redundancy. I’m not going to talk about how to do that class of risk analysis, but (modulus adversarial attacks, which I’ll mention later,) estimating engineering reliability is a solved problem—if we don’t have other problems to deal with. But we do, because cars are complex and interact with the wider world—so the trick will be integrating those risk analysis guarantees that we can prove into larger systems, and finding ways to build broader guarantees on top of them.

But for the engineering reliability, we don’t only have engineering proof. Work like DARPA’s VerifAI is “applying formal methods to perception and ML components.” Building guarantees about perception failure rates based on the sensors gives us another layer of proven architecture to build on. And we could do similar work for how cars deal with mechanical failures, other types of sensor failures, and so on as inputs to the safety model. Of course, this is not a challenge particularly related to AI, and it is a (largely solved) problem related to vehicle reliability, and traditional engineering analysis and failure rates could instead be one of the inputs to the model assumptions, with attendant issues dealing with uncertainty propagation so we get proven probabilistic guarantees at this level as well.

Proven Safe Driving

Proving that a car is not just safe to drive, but is being driven safely requires a very different set of assumptions and approaches. To get such an assurance, we’d need provable formal statements about what safe driving is, in order to prove them. And it’s important to figure out what safety means in multi-agent sociotechnical systems. For example, we say someone is a safe driver when they drive defensively, even if another car could crash into them. That’s because safety in multiperson systems isn’t about guaranteeing no harm, it’s about guaranteeing that the agents behavior doesn’t cause that harm.

Luckily, it turns out that there’s something remarkably close to what we want already. Responsibility Sensitive Safety (RSS) [Edit to add: with a formal paper laying out the details and proof here.] formalizes what a car needs to do not to cause accidents. That is, if a car drives safely, any accidents which could occur will be attributable to other cars actions. In the case of RSS, it’s provable that if other cars follow the law, and/​or all cars on the road abide by the rules[Edit to add: , and no pedestrians jump out in front of the car when not at a crosswalk, or do similar things which the car cannot dodge, nor perform other illegal acts, and there are no acts of god such as meteors hitting the car or lightning strikes, then with some finite and low probability,] those cars will only have [edit: cause] accidents if their sensors are incorrect. Of course, if another car[ or pedestrian] fails to abide by the rules, safety isn’t guaranteed—but as we’ll mention later, safe driving can’t mean that the car cannot be hit by a negligent or malicious driver, otherwise the safety we’re working towards is impossible!

Proving that cars won’t cause crashes now builds on the risk analysis we described as provable above. Relying on assumptions about other car behavior is a limit to provable safety—but one that provides tremendously more leverage for proof than just removing buffer overflow attacks. That is, if we can show formally that given correct sensor data, a car will only do things allowed in the RSS model, and we build that system on top of the sensors described above, we have shown that it is safe in a very strong sense[; again, this means the system is proveably safe modulo assumptions].

This three part blueprint for provable safety of cars can address the levels in between the provably safe kernel, the responsibility sensitive safety guarantees, and the risk analysis for sensors. If we can prove that code running on the safe kernel can proveably provide the assurances needed for driving, on the condition that the sensors work correctly, and can provide engineering reliability results for those sensors, we have built a system that has provably bounded risk.

Provably Safe ML

Unfortunately, self-driving cars don’t actually just use code of types that can be formally verified, they use neural networks—systems which are poorly understood and vulnerable to a wide variety of hard to characterize failures. Thankfully, we do not need to solve AI safety in general to have safety narrowly. How can we do this?

One potential solution is to externally gate the AI system with provable code. In this case, the driving might be handled by an unsafe AI system, but its behavior would have “safety in the loop” by having simpler and provably safe code restrict what the driving system can output, to respect the rules noted above. This does not guarantee that the AI is a safe driver—it just keeps such systems in a provably safe box.

That isn’t, of course, the only approach. Another is to have the AI system trained to drive the way we want, then use model parroting or a similar approach in order to train a much simpler and fully interpretable model, such that we can verify its properties formally. Alternatively, we can use something like constructible AI in place of black-box systems, and prove properties of the composed parts. In each case, Guaranteed Safe AI is not a tool for guaranteeing AI alignment in general, it is a tool for building specific safe systems.

Adversarial Concerns

Once a self-driving car is constrained to follow the rules of the road with some provable reliability, despite failures in its systems, we still need to worry about other concerns. Most critically, we need to consider adversarial attacks and robustness, on two fronts. The first is indirectly malicious adversarial behavior, accidentally or purposefully using the self-driving cars limitations and rule sets to exploit their behavior. These can be severe, such as causing crashes, as discussed here. But even a safe car cannot eliminate such attacks, as mentioned earlier. In fact we would hope that, say, a car driving towards an autonomous vehicle at high speed would cause the autonomous vehicle to move out of the way, even if that meant it instead hits a stationary object or a different car, if it can do so in ways that reduce damage. Less extreme are attacks that cause the car to move to the side of the road unnecessarily, or create other nuisances for drivers. These acts, while unfortunate, are not safety issues.

However, there is a more widely discussed concern that engineered attacks on our otherwise safe system could “hide” stop signs, as has been shown, repeatedly, or perhaps modify other car’s paint so that the other car is not recognized by sensors. This is definitely a concern that robustness research has been working on, and such work is useful. On the other hand, we do not blame human drivers if someone maliciously removes a stop sign; one again, provable safety does not imply that others cannot cause harm.

We also note that the claim that an adversarial attack was conducted without violating laws, including traffic laws, does not shield the perpetrator from criminal charges including attempted murder, and the “intentional act exception“ for insurance would mean that the perpetrator of such acts would be personally liable, without any insurance protection. Our extant legal system can handle this without any need for specific changes.

Defining Broader Sociotechnological Proveably Safe AI

There are other domains where the types of safety guarantees we want for AI systems are much stronger than simply not causing harm. For example, an AI system that explains how to make bioweapons would not itself have caused harm, it would only have enabled a malicious actor to do so. But what we have shown is that we can build a sociotechnical definition of responsible behavior that is coherent, and can be specified in a given domain.

In some cases, similar to the self-driving car example, the rule we desire could closely parallel current law for humans. In the bioweapons case, materially assisting in the commission of a bioterror attack would certainly be illegal for humans, and the same could be required for artificial intelligence. Formalizing this is obviously difficult, but it is not equivalent to the effectively impossible task of fully modeling the entire human body and its interaction with some novel pathogen. (To avoid the need, we can allow the false positive rate to be far higher than the false negative rate, as many humans do when unsure if an action is strictly illegal, and only allow things we’re resonably sureare safe.)

But in other cases, such as autonomy risks from AI, we would expect that the rules needed for AI systems would differ substantially from human law, and definitions for what qualifies as safety would need to be developed before proveablity could be meaningful.

Conclusion

It seems that building provably safe systems in the real world is far from an unsolvable problem, as long as we restrict the problem to solve to something that is clearly defined. We can imagine similar guarantees for AI systems for cyber-security, with protections against privilege escalation attacks performed by the AI, or for “escape” scenarios, with protections against self-exfiltration, or (as suggested but, in my view mistakenly, derided) for DNA synthesis, with guarantees that all synthesis orders were screened for misuse potential. None of these will stop people from doing these unsafely in ways not covered by safety guarantees, nor will they prevent hypothetical strongly superhuman AI from finding devious edge cases, or inventing and exploiting new physics.

And if that is the concern, looking more towards AI alignment, we could even construct systems with formal guarantees that all planning is approved by a specific process, to scaffold proposals like Humans consulting HCH, or stronger alternatives. But to provide these stronger guarantees, we need some fundamental progress in proveability of types which are needed for more prosaic applications like self-driving. And in the meantime, there are many cases which could benefit from the clearer definitions and stronger guarantees of safety that proveably safe AI would provide.