Note that effectively we are saying to trust the neural network
I expect that we’re going to have to rely on some neural networks regardless of how we approach AI. This paper guides us to be more strategic about what reliance to put on which neural networks.
Fortunately, for coarse “guardrails” the specs are pretty simple and can often be reused in many contexts. For example, all software we want to run should have proofs that: 1) there aren’t memory leaks, 2) there aren’t out-of-bounds memory accesses, 3) there aren’t race conditions, 4) there aren’t type violations, 5) there aren’t buffer overflows, 6) private information is not exposed by the program, 7) there aren’t infinite loops, etc. There should be a widely used “metaspec” for those criteria which most program synthesis AI will have to prove their generated code satisfies. Similarly, there are common constraints for many physical systems: eg. robots, cars, planes, boats, etc. shouldn’t crash into things or harm humans, etc. The more refined the rules are, the more subtle they become. To prevent existentially bad outcomes, I believe coarse constraints suffice. But certainly we eventually want much more refined models of the world and of the outcomes we seek. I’m a fan of “Digital Twins” of physical systems which allow rules and constraints to be run in simulation which can help in choosing specifications. We certainly want those simulations to be trusted. which can be achieved by proving the code actually simulates the systems it claims to. Eventually it would be great to have fully trusted AI as well! Mechanistic Interpretability should be great for that! I’m just reading Anthropic’s recent nice advances in that. If that continues to make progress then it makes our lives much easier but it doesn’t eliminate the need to ensure that misaligned AGI and malicious AGI don’t cause harm. The big win with the proof checking and the cryptographic hardware we propose is that we can ensure that even powerful systems will obey rules that humanity selects. If we don’t implement that kind of system (or something functionally equivalent), then there will be dangerous pathways which malicious AGI can exploit to cause great harm to humans.
I expect that we’re going to have to rely on some neural networks regardless of how we approach AI. This paper guides us to be more strategic about what reliance to put on which neural networks.
Fortunately, for coarse “guardrails” the specs are pretty simple and can often be reused in many contexts. For example, all software we want to run should have proofs that: 1) there aren’t memory leaks, 2) there aren’t out-of-bounds memory accesses, 3) there aren’t race conditions, 4) there aren’t type violations, 5) there aren’t buffer overflows, 6) private information is not exposed by the program, 7) there aren’t infinite loops, etc. There should be a widely used “metaspec” for those criteria which most program synthesis AI will have to prove their generated code satisfies. Similarly, there are common constraints for many physical systems: eg. robots, cars, planes, boats, etc. shouldn’t crash into things or harm humans, etc. The more refined the rules are, the more subtle they become. To prevent existentially bad outcomes, I believe coarse constraints suffice. But certainly we eventually want much more refined models of the world and of the outcomes we seek. I’m a fan of “Digital Twins” of physical systems which allow rules and constraints to be run in simulation which can help in choosing specifications. We certainly want those simulations to be trusted. which can be achieved by proving the code actually simulates the systems it claims to. Eventually it would be great to have fully trusted AI as well! Mechanistic Interpretability should be great for that! I’m just reading Anthropic’s recent nice advances in that. If that continues to make progress then it makes our lives much easier but it doesn’t eliminate the need to ensure that misaligned AGI and malicious AGI don’t cause harm. The big win with the proof checking and the cryptographic hardware we propose is that we can ensure that even powerful systems will obey rules that humanity selects. If we don’t implement that kind of system (or something functionally equivalent), then there will be dangerous pathways which malicious AGI can exploit to cause great harm to humans.