Provably Safe AI: Worldview and Projects

In September 2023, Max Tegmark and Steve Omohundro proposed “Provably Safe AI” as a strategy for AI Safety. In May 2024, a larger group delineated the broader concept of “Guaranteed Safe AI” which includes Provably Safe AI and other related strategies. In July, 2024, Ben Goldhaber and Steve discussed Provably Safe AI and its future possibilities, as summarized in this document.

Background

In June 2024, ex-OpenAI AI Safety Researcher Leopold Aschenbrenner wrote a 165-page document entitled “Situational Awareness, The Decade Ahead” summarizing AI timeline evidence and beliefs which are shared by many frontier AI researchers. He argued that human-level AI is likely by 2027 and will likely lead to superhuman AI in 2028 or 2029. “Transformative AI” was coined by Open Philanthropy to describe AI which can “precipitate a transition comparable to the agricultural or industrial revolution”. There appears to be a significant probability that Transformative AI may be created by 2030. If this probability is, say, greater than 10%, then humanity must immediately begin to prepare for it.

The social changes and upheaval caused by Transformative AI are likely to be enormous. There will likely be many benefits but also many risks and dangers, perhaps even existential risks for humanity. Today’s technological infrastructure is riddled with flaws and security holes. Power grids, cell service, and internet services have all been very vulnerable to accidents and attacks. Terrorists have attacked critical infrastructure as a political statement. Today’s cybersecurity and physical security barely keeps human attackers at bay. When these groups obtain access to powerful cyberattack AI’s, they will likely be able to cause enormous social damage and upheaval.

Humanity has known how to write provably correct and secure software since Alan Turing’s 1949 paper. Unfortunately, proving program correctness requires mathematical sophistication and it is rare in current software development practice. Fortunately, modern deep learning systems are becoming proficient at proving mathematical theorems and generating provably correct code. When combined with techniques like “autoformalization,” this should enable powerful AI to rapidly replace today’s flawed and insecure codebase with optimized, secure, and provably correct replacements. Many researchers working in these areas believe that AI theorem-proving at the level of human PhD’s is likely about two years away.

Similar issues plague hardware correctness and security, and it will be a much larger project to replace today’s flawed and insecure hardware. Max and Steve propose formal methods grounded in mathematical physics to produce provably safe physical designs. The same AI techniques which are revolutionizing theorem proving and provable software synthesis are also applicable to provable hardware design.

Finally, today’s social mechanisms like money, contracts, voting, and the structures of governance, will also need to be updated for the new realities of an AI-driven society. Here too, the underlying rules of social interaction can be formalized, provably effective social protocols can be designed, and secure hardware implementing the new rules synthesized using powerful theorem proving AIs.

What’s next?

Given the huge potential risk of uncontrolled powerful AI, many have argued for a pause in Frontier AI development. Unfortunately, that does not appear to be a stable solution. Even if the US paused its AI development, China or other countries could gain an advantage by accelerating their own work.

There have been similar calls to limit the power of open source AI models. But, again, any group anywhere in the world can release their powerful AI model weights on BitTorrent. Mark Zuckerberg has concluded that it is better to open source powerful AI models for the white hats than to try to keep them secret from the black hats. In July 2024, Meta released the Llama 3.1 405B large language model which has a comparable performance to the best closed source models. We should expect access to the most powerful AI models to grow, and that this will enable both new levels of harmful cyber/​social attacks and powerful new safety and security capabilities.

Society tends to respond to threats only after they have already caused harm. We have repeatedly seen that once harmful actions have occurred, however, citizens and politicians rapidly demand preventive measures against recurrences. An AI-powered cyberattack which brings down the power grid, financial system, or communication networks would likely cause a huge demand for AI-secure infrastructure. One powerful way to have an impact is to prepare a strategy in advance for responding to that demand. The Manifold prediction market currently estimates a 29% chance of an AI “warning shot” before October 20, 2026. If this is accurate, it would be of great value to create detailed plans over the next two years.

Here is a potential positive scenario: In 2025, groups who see the potential for “Provably Safe Technology” fine tune Llama 4 to create a powerful model for generating verified software and hardware. This model spreads widely and by using “autoformalization,” groups are able to feed it their current documentation and code and have it generate highly optimized, provably correct, provably secure replacements. The world’s software vulnerabilities rapidly disappear. Chipmakers use the new AI to rapidly generate provably correct, efficient, and provably secure chips for every function. The new AI redesigns cryptographic protocols to eliminate quantum-vulnerabilities and other cryptographic weaknesses. The AI also generates new social mechanisms, eliminating current inefficiencies and vulnerabilities to corruption. And the new mechanisms enable governments to create mutually-beneficial treaties with hardware enforcement. “Provably-Constrained Self-Improvement” replaces “Unconstrained Self-Improvement” and GPU hardware is changed to only execute constrained models. AI improves to help humanity and we create infrastructure that enables all humans to thrive!

With more provably secure infrastructure in place, the surface area for disruption from AI misuse will be reduced, as well as by exploitation by rogue autonomous AIs. Expanding the ‘formally safe zone’ will increase the difficulty of takeover by AIs in general.

In the longer term, the “provably safe” approach has the potential to reduce the risk of AI takeover and to solve many of humanity’s current problems. This requires building infrastructure which is invulnerable to AI attacks (as outlined in the Tegmark and Omohundro paper). It would enable a future in which powerful AIs can be tasked with solving humanity’s biggest challenges (eg. curing cancer, solving the climate crisis, eliminating social dilemmas, etc.) while provably preventing runaway self-improvement.

Some Useful Projects

Strategically, the above considerations suggest that we emphasize preparation for the use of powerful future AI tools rather than putting large efforts into building systems with today’s tools. For example, today a software team might spend a year designing, building, and testing a system with a good user interface to use AI to safely control an industrial process. In two years, an AI synthesis tool may be able to generate a formally verified implementation in an afternoon from a specification of the requirements. As AI software synthesis and theorem proving systems advance, they should be able to rapidly impact human infrastructure.

The currently best developed of the needed tools are software formal methods systems like the Dafny Programming and Verification Language which can prove that code satisfies a formal specification. But they require accurate specifications of the semantics of the programming language and of the desired software behavior.

Below are ideas for projects which can advance aspects of the Provably Safe agenda by advancing the principles and methods for building Provably Safe systems, or by demonstrating the potential of provably safe systems using current or likely-to-be-created-soon tools.

1) Improve methods for creating formal specifications: of systems, desired behaviors, and safety requirements. For example, there should be databases of existing specification elements and semantic search to find and verify desired behaviors.

2) Formally translate between formal languages: There are many theorem proving languages (eg. Lean, Coq, Isabelle, MetaMath, etc.) and many formal methods languages (eg. Dafny, Alloy, PRISM, etc.) but almost no interconvertibility between these. All the powerful mathematical theorems in Lean should be available to a programmer using Dafny. Modern AI tools should be able to translate between any of these and to represent the semantics of each.

3) A semantic library for probabilistic programming and machine learning: All of the popular statistical and machine learning methods should be formally specified with a range of provable bounds on behavior.

4) Precise formal specifications for physical systems: Each of the engineering disciplines (eg. Mechanical Engineering, Electrical Engineering, Chemical Engineering, etc.) needs to be formally codified and grounded in fundamental physics. Engineers should be able to provably verify properties of their designs. AI systems should be able to generate designs which provably meet desired specification requirements.

5) Formal models of security and cryptographic properties: Provable verification of human designs, automatic generation of verifiable designs, autoformalization of existing designs and resynthesis to eliminate flaws and security holes.

6) Demonstrate provable safety in a domain: by showing that an adversary with a specified set of capabilities is not able to push a designed system into a specified unsafe state.

7) Build a formal model of mechanical engineering: capturing all the elements that today’s engineers use in mechanical designs. While much of mathematics has been formalized, to great success, many mechanical engineering practices are not represented.

8) Provably unpickable mechanical lock: As an example, a concrete project which combines security with mechanical engineering would be to design a “provably unpickable mechanical lock”. This would involve simulating designs of mechanical locks and processes for mechanically picking them. Verify human designs and automatically create AI-generated designs which provably cannot be opened by mechanical picking.

9) Composable Specifications Library: Develop a comprehensive library of safety specifications that can be composed and reused across different GS AI projects. This initiative would include

a) Cataloging and standardizing existing specifications from various domains.

b) Creating tools and methodologies for combining and adapting specifications for new use cases.

c) Implementing adversarial processes to continually generate and refine specifications, potentially leveraging automated systems (example)

d) Researching methods to provide assurance of specification completeness and coverage.

10) Guaranteed Safe by Design Development Process: At the heart of the agenda is using a world model, safety specifications, and a verifier (described in the recent paper Towards Guaranteed Safe AI) to verify the plans and actions of AIs. This is a framework and methodology for developing AI systems, and needs to be operationalized for real world systems. Examples of projects here include:

a.) defining the adversarial exploration process for generating specs

b.) creating the process for defining the ‘relevant’ world model for a domain

c.) creating processes for updating world models and safety specifications over time

d.) creating a prize and incentive system for generating the pieces of a system. This could be meta—for example, can we create the safety specs needed for operationalizing prize generation to resist adversary exploitation?

11) Provable hardware projects: Create foundational building blocks for guaranteed safe AI systems, focusing on components like tamper-evident sensors, secure memory, and verifiable computation units. An initial goal is to demonstrate that it’s possible to construct hardware with mathematically provable security properties, providing a robust foundation for safe AI development.

12) Digital Twins for Guaranteed Safe AI: Create accurate digital representations of both cyber-cyber systems and cyber-physical systems following GS AI principles, and projects which could be pushed forward by non-profit or for-profit companies. Ideally these would be interoperable simulations, perhaps building on existing initiatives like NVIDIA’s Omniverse. Examples include building digital twins for existing AI evaluation workflows, using digital twins for cybersecurity, and building provably secure buildings which include specs that represent a wide range of human preferences.

13) Demo Networks: Create small-scale, fully provable secure systems that serve as evidence of the feasibility of Guaranteed Safe AI approaches. These demo networks would showcase end-to-end security, from hardware to software, would provide tangible examples for stakeholders and policymakers, and could be adopted during “warning shot” scenarios to illustrate the practicality and effectiveness of GS AI principles. For example, a provably secure camera system, or a provably secure communication network.

14) Coordination and Governance: Industry coordination projects (example) that develop the protocols and standards for formally verifying software and hardware, or working on policy ideas and initiativesfor using strong hardware enforced mechanisms to verify treaties for executing constrained models.