But in any case, advocates of GS approaches are not, for the most part, talking about estimates, but instead believe we can obtain strong proofs that can effectively guarantee failure rates of 0% for complex AI software systems deployed in the physical world
I don’t think this paragraph’s description of the Guaranteed Safe AI approach is accurate or fair. Different individuals may place varying emphasis on the claims involved. If we examine the Guaranteed Safe AI position paper that you mentioned (https://arxiv.org/abs/2405.06624), we’ll notice a more nuanced presentation in two key aspects:
2. The verifier itself can fall within a spectrum and still be considered consistent with the Guaranteed Safe AI approach. While having a verifier that can produce a formal proof of the specified probability bound, which can be checked in a proof checker, would be very powerful, it’s worth noting that a procedure capable of computing the probability bound for which we have quantifiable converge rates would also be regarded as a form of guaranteed quantitative safety verification. (See the Levels in https://arxiv.org/abs/2405.06624 section 3.4).
With that being said, I believe that setting an ambitious goal like “Provable/Guaranteed Safe AI” and clearly defining what it would mean to achieve such a goal, along with conceptual tools for systematically evaluating our progress, is extremely valuable. Given the high stakes involved, I think that even if it turns out that the most advanced version of the Guaranteed Safe AI approach is not possible (which we cannot ascertain at this point), it is still both useful and necessary to frame the conversation and assess current approaches through this lens.
The concern you raise is something that I thought about quite a bit while writing this post/paper. I do address your concern briefly in several parts of the post and considered addressing it explicitly in greater detail, but ultimately decided not to because the post was already getting quite long. The first part where I do mention it is quoted below. I also include the quote from [1] that is from that part of the post as well, since it adds very helpful context.
At the same time, obtaining an estimate that a DNA synthesis machine will only produce a dangerous pathogen 30% (or even 1%) of the time is much less interesting than a guarantee that it will not do so at all. But in any case, advocates of GS approaches are not, for the most part, talking about estimates, but instead believe we can obtain strong proofs that can effectively guarantee failure rates of 0% for complex AI software systems deployed in the physical world, as we can see from the following quote (emphasis mine again):
Proof-carrying AGI running on PCH appears to be the only hope for a guaranteed solution to the control problem: no matter how superintelligent an AI is, it can’t do what’s provably impossible. So, if a person or organization wants to be sure that their AGI never lies, never escapes and never invents bioweapons, they need to impose those requirements and never run versions that don’t provably obey them.
Proof-carrying AGI and PCH can also eliminate misuse. No malicious user can coax an AGI controlled via an API to do something harmful that it provably cannot do. And malicious users can’t use an open-sourced AGI to do something harmful that violates the PCH specifications of the hardware it must run on. There must be global industry standards that check proofs to constrain what code powerful hardware and operating systems will run. [1]
Now jumping into to your specific points in greater detail:
First, I want to note that there is a spectrum of ways in which “GS researchers” are talking about these techniques and how strong they expect the guarantees to be. For example, in [1] (Tegmark/Omohundro) the words “probability” or “probabilistic” are never mentioned until towards the end of the paper in two of the sub-sections of the “challenge problems” section. I also think that the quoted section from [1] above gives a good sense of how these authors expect that we will need proofs that offer100% certainty for most real safety use cases (e.g. “So, if a person or organization wants to be sure that their AGI never lies, never escapes and never invents bioweapons, they need to impose those requirements and never run versions that don’t provably obey them.”) As the authors discuss, based on the fact that we can expect adversarial activity, from both human agents and AIs themselves, they basically require proofs that are 100% certain to support their claims. I discuss this later in my post as well:
And second, as the same researchers point out, the degree of adversarial activity that may be encountered in the AI safety context may in fact require that this [Limitation 1] is overcome in order for formal verification-based approaches to succeed. Quoting again from Dalrymple et al.,
Moreover, it is also important to note that AI systems often will be deployed in adversarial settings, where human actors (or other AIs) actively try to break their safety measures. In such settings empirical evaluations are likely to be inadequate; there is always a risk that an adversary could be more competent at finding dangerous inputs, unless you have a strong guarantee to the contrary…. This makes it challenging for an empirical approach to rule out instances of deceptive alignment, where a system is acting to subvert the evaluation procedure by detecting features of the input distribution that are unique to the test environment (Hubinger et al., 2021). [3]
This brings us to the authors of [3], which is the paper you mention. While I agree that Dalrymple et al. discuss probabilistic approaches quite a bit in their paper, I think that by their own account (as articulated in the above quote), basically all of the really interesting results that they hope for would require proofs with certainty of 100% or very close to it. This is true for both longer-term threats like loss-of-control where there are strong AI adversaries that will exploit any weakness, but also situations where you have something like a DNA synthesis machine, where a guarantee that it will only produce a bioweapon 10% of the time, or even 1% of the time is not very interesting because of the significance of even a single failure and the presence of human adversaries who will seek to misuse the machines. With all of this in mind, while it is true that authors of [3] discuss probabilistic approaches quite a bit, my sense is that even they would acknowledge that when we zoom in on real AI threat scenarios of greatest interest, guarantees of close to 100%, or very close to it, would be required to achieve interesting results. Although I would love for someone to articulate what counterexamples of this might look like—i.e. an example of a probabilistic guarantee about an AI threat that is not very close to 100% that would still be interesting.
There is one more note related to the above, which I won’t take the time to elaborate in great detail here, but which I will sketch. Namely, I believe that for examples like proving that a DNA machine will not cause harm, or an AGI will not escape our control, the level of GS modeling/specification/verification that would be required to obtain a version of such a proof/guarantee that has 50% confidence (for example), is probably quite close to the level that would be required to obtain close to 100% confidence, because it’s hard to see how one could obtain such a result without still creating extremely detailed simulations of large parts of the AI system and the physical world. Again, open to be proven wrong here if someone can sketch what an alternative approach might look like.
Now to address your second point:
The verifier itself can fall within a spectrum and still be considered consistent with the Guaranteed Safe AI approach. While having a verifier that can produce a formal proof of the specified probability bound, which can be checked in a proof checker, would be very powerful, it’s worth noting that a procedure capable of computing the probability bound for which we have quantifiable converge rates would also be regarded as a form of guaranteed quantitative safety verification. (See the Levels in https://arxiv.org/abs/2405.06624 section 3.4).
While it is technically true that there are certain types of approaches that fall under “Guaranteed Safety” according to the definition of GS given in [3], which would not be subject to the limitations I describe in my post (for example, “Level 0: No quantitative guarantee is produced”, which technically counts as a GS guarantee, according to [3]), my post is really only responding to the authors’ discussion of levels of GS that would be required for “formal verification”, which is the same as saying those that would be required for getting the interesting results that everyone is really excited about here, against major AI threats. For the results of this sort, at least based on the discussion in the paper and my initial understanding, formal verification with something like 100% certainty appears to be required (although again I would love for someone to provide counterexamples to this claim).
Responding to your final thought:
Given the high stakes involved, I think that even if it turns out that the most advanced version of the Guaranteed Safe AI approach is not possible (which we cannot ascertain at this point), it is still both useful and necessary to frame the conversation and assess current approaches through this lens.
I would love to see this as well and I believe that this is exactly how the conversation about GS techniques should rationally evolve from here! That is, if we are not likely to obtain strong formal proofs and guarantees against AI threats in the near-term based on the limitations discussed this post, what are the best results that we can hope to get out of the lower-number levels of the GS spectrums of modeling, specification and verification in that case? My guess is that these results will be much less exciting and will look a lot more like existing safety solutions such as RLHF, heuristics-based guardrails and empirical studies of bad behavior by models, but I think it’s possible that some unexpected new workable solutions may also come out of the exercise of exploring the lower levels of the GS spectrum(s), which do not utilize formal verification, as well.
I don’t think this paragraph’s description of the Guaranteed Safe AI approach is accurate or fair. Different individuals may place varying emphasis on the claims involved. If we examine the Guaranteed Safe AI position paper that you mentioned (https://arxiv.org/abs/2405.06624), we’ll notice a more nuanced presentation in two key aspects:
1. The safety specification itself may involve probabilities of harmful outcomes. The approach does not rely on guaranteeing a 0% failure rate, but rather on ensuring a quantifiable bound on the probability of failure. This becomes clear in Davidad’s Safeguarded AI program thesis: https://www.aria.org.uk/wp-content/uploads/2024/01/ARIA-Safeguarded-AI-Programme-Thesis-V1.pdf
2. The verifier itself can fall within a spectrum and still be considered consistent with the Guaranteed Safe AI approach. While having a verifier that can produce a formal proof of the specified probability bound, which can be checked in a proof checker, would be very powerful, it’s worth noting that a procedure capable of computing the probability bound for which we have quantifiable converge rates would also be regarded as a form of guaranteed quantitative safety verification. (See the Levels in https://arxiv.org/abs/2405.06624 section 3.4).
With that being said, I believe that setting an ambitious goal like “Provable/Guaranteed Safe AI” and clearly defining what it would mean to achieve such a goal, along with conceptual tools for systematically evaluating our progress, is extremely valuable. Given the high stakes involved, I think that even if it turns out that the most advanced version of the Guaranteed Safe AI approach is not possible (which we cannot ascertain at this point), it is still both useful and necessary to frame the conversation and assess current approaches through this lens.
Agustin—thanks for your thoughtful comment.
The concern you raise is something that I thought about quite a bit while writing this post/paper. I do address your concern briefly in several parts of the post and considered addressing it explicitly in greater detail, but ultimately decided not to because the post was already getting quite long. The first part where I do mention it is quoted below. I also include the quote from [1] that is from that part of the post as well, since it adds very helpful context.
Now jumping into to your specific points in greater detail:
First, I want to note that there is a spectrum of ways in which “GS researchers” are talking about these techniques and how strong they expect the guarantees to be. For example, in [1] (Tegmark/Omohundro) the words “probability” or “probabilistic” are never mentioned until towards the end of the paper in two of the sub-sections of the “challenge problems” section. I also think that the quoted section from [1] above gives a good sense of how these authors expect that we will need proofs that offer100% certainty for most real safety use cases (e.g. “So, if a person or organization wants to be sure that their AGI never lies, never escapes and never invents bioweapons, they need to impose those requirements and never run versions that don’t provably obey them.”) As the authors discuss, based on the fact that we can expect adversarial activity, from both human agents and AIs themselves, they basically require proofs that are 100% certain to support their claims. I discuss this later in my post as well:
This brings us to the authors of [3], which is the paper you mention. While I agree that Dalrymple et al. discuss probabilistic approaches quite a bit in their paper, I think that by their own account (as articulated in the above quote), basically all of the really interesting results that they hope for would require proofs with certainty of 100% or very close to it. This is true for both longer-term threats like loss-of-control where there are strong AI adversaries that will exploit any weakness, but also situations where you have something like a DNA synthesis machine, where a guarantee that it will only produce a bioweapon 10% of the time, or even 1% of the time is not very interesting because of the significance of even a single failure and the presence of human adversaries who will seek to misuse the machines. With all of this in mind, while it is true that authors of [3] discuss probabilistic approaches quite a bit, my sense is that even they would acknowledge that when we zoom in on real AI threat scenarios of greatest interest, guarantees of close to 100%, or very close to it, would be required to achieve interesting results. Although I would love for someone to articulate what counterexamples of this might look like—i.e. an example of a probabilistic guarantee about an AI threat that is not very close to 100% that would still be interesting.
There is one more note related to the above, which I won’t take the time to elaborate in great detail here, but which I will sketch. Namely, I believe that for examples like proving that a DNA machine will not cause harm, or an AGI will not escape our control, the level of GS modeling/specification/verification that would be required to obtain a version of such a proof/guarantee that has 50% confidence (for example), is probably quite close to the level that would be required to obtain close to 100% confidence, because it’s hard to see how one could obtain such a result without still creating extremely detailed simulations of large parts of the AI system and the physical world. Again, open to be proven wrong here if someone can sketch what an alternative approach might look like.
Now to address your second point:
While it is technically true that there are certain types of approaches that fall under “Guaranteed Safety” according to the definition of GS given in [3], which would not be subject to the limitations I describe in my post (for example, “Level 0: No quantitative guarantee is produced”, which technically counts as a GS guarantee, according to [3]), my post is really only responding to the authors’ discussion of levels of GS that would be required for “formal verification”, which is the same as saying those that would be required for getting the interesting results that everyone is really excited about here, against major AI threats. For the results of this sort, at least based on the discussion in the paper and my initial understanding, formal verification with something like 100% certainty appears to be required (although again I would love for someone to provide counterexamples to this claim).
Responding to your final thought:
I would love to see this as well and I believe that this is exactly how the conversation about GS techniques should rationally evolve from here! That is, if we are not likely to obtain strong formal proofs and guarantees against AI threats in the near-term based on the limitations discussed this post, what are the best results that we can hope to get out of the lower-number levels of the GS spectrums of modeling, specification and verification in that case? My guess is that these results will be much less exciting and will look a lot more like existing safety solutions such as RLHF, heuristics-based guardrails and empirical studies of bad behavior by models, but I think it’s possible that some unexpected new workable solutions may also come out of the exercise of exploring the lower levels of the GS spectrum(s), which do not utilize formal verification, as well.