I am a big fan of verification in principle, but don’t think it’s feasible in such broad contexts in practice. Most of these proposals are far too vague for me to have an opinion on, but I’d be happy to make bets with anyone interested that
(3) won’t happen (with nontrivial bounds) due to the difficulty of modeling nondeterministic GPUs, varying floating-point formats, etc.
(8) won’t be attempted, or will fail at some combination of design, manufacture, or just-being-pickable. This is a great proposal and a beautifully compact crux for the overall approach.
with some mutually agreed operationalization against the success of other listed ideas with a physical component
for up to 10k of my dollars at up to 10:1 odds (i.e. min your $1k), placed by end of 2024, to resolve by end of 2026 or as agreed. (I’m somewhat dubious about very long-term bets for implementation reasons) I’m also happy to discuss conditioning on a registered attempt.
For what it’s worth, I was offered and chose to decline co-authorship of Towards Guaranteed Safe AI. Despite my respect for the authors and strong support for this research agenda (as part of a diverse portfolio of safety research), I disagree with many of the central claims of the paper, and see it as uncritically enthusiastic about formal verification and overly dismissive of other approaches.
I do not believe that any of the proposed approaches actually guarantee safety. Safety is a system property,[1] which includes the surrounding social and technical context; and I see serious conceptual and in-practice-fatal challenges in defining both a world-model and a safety spec. (exercise: what could these look like for a biology lab? Try to block unsafe outcomes without blocking Kevin Esvelt’s research!)
I do think it’s important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification—whatever actually works in practice.
(8) won’t be attempted, or will fail at some combination of design, manufacture, or just-being-pickable. This is a great proposal and a beautifully compact crux for the overall approach.
I agree with you that this feels like a ‘compact crux’ for many parts of the agenda. I’d like to take your bet, let me reflect if there’s any additional operationalizations or conditioning.
However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification—whatever actually works in practice.
FWIW in Towards Guaranteed Safe AI I we endorse this: “Moreover, while we have argued for the need for verifiable quantitative safety guarantees, it is important to note that GS AI may not be the only route to achieving such guarantees. An alternative approach might be to extract interpretable policies from black-box algorithms via automated mechanistic interpretability… it is ultimately an empirical question whether it is easier to create interpretable world models or interpretable policies in a given domain of operation.”
I agree with you that this feels like a ‘compact crux’ for many parts of the agenda. I’d like to take your bet, let me reflect if there’s any additional operationalizations or conditioning.
quick proposals:
I win at the end of 2026, if there has not been a formally-verified design for a mechanical lock, OR the design does not verify it cannot be mechanically picked, OR less than three consistent physical instances have been manufactured. (e.g. a total of three including prototypes or other designs doesn’t count)
You win if at the end of 2027, there have been credible and failed expert attempts to pick such a lock (e.g. an open challenge at Defcon). I win if there is a successful attempt.
Bet resolves neutral, and we each donate half our stakes to a mutally-agreed charity, if it’s unclear whether production actually happened, or there were no credible attempts to pick a verified lock.
Any disputes resolved by the best judgement of an agreed-in-advance arbiter; I’d be happy with the LessWrong team if you and they also agree.
Unsolicited suggestion: it is probably useful for y’all to define further what “pick a lock”means—e.g., if someone builds a custom defeat device of some sort, that does some sort of activity that is non-destructive but engages in a mechanical operation very surprising to someone thinking of traditional lock-picking methods—does that count?
(I think you’d probably say yes, so long as the device isn’t, e.g., a robot arm that’s nondestructively grabbing the master key for the lock out of Zac’s pocket and inserting it into the lock, but some sort of definining-in-advance would likely help.)
Nonetheless, think this would be awesome as an open challenge at Defcon (I suspect you can convince them to Black Badge the challenge...)
it is probably useful for y’all to define further what “pick a lock”means
Well, a lot of the difficulty of any kind of formal proof is indeed that any specification you come up with will have holes you didn’t anticipate. As such, coming up with a solid definition of what “pick a lock” means is a large part of the difficulty of this formal verification endeavor, and as such I don’t think it makes sense to try to get that out of the way before the bet is even made. I think deferring to a trusted third arbiter whether the definition chosen for the proof is indeed an adequate definition would be a better choice.
I mean, I think it’s worth doing an initial loose and qualitative discussion to make sure that you’re thinking about overlapping spaces conceptually. Otherwise, not worth the more detailed effort.
This seems mostly good to me, thank you for the proposals (and sorry for my delayed response, this slipped my mind).
OR less than three consistent physical instances have been manufactured. (e.g. a total of three including prototypes or other designs doesn’t count)
Why this condition? It doesn’t seem relevant to the core contention, and if someone prototyped a single lock using a GS AI approach but didn’t figure out how to manufacture it at scale, I’d still consider it to have been an important experiment.
I don’t think that a thing you can only manufacture once is a practically usable lock; having multiple is also practically useful to facilitate picking attempts and in case of damage—imagine that a few hours into an open pick-this-lock challenge, someone bent a part such that the key no longer opens the lock. I’d suggest resolving neutral in this case as we only saw an partial attempt.
Other conditions:
I think it’s important that the design could have at least a thousand distinct keys which are non-pickable. It’s fine if the theoretical keyspace is larger so long as the verified-secure keyspace is large enough to be useful, and distinct keys/locks need not be manufactured so long as they’re clearly possible.
I expect the design to be available in advance to people attempting to pick the lock, just as the design principles and detailed schematics of current mechanical locks are widely known—security through obscurity would not demonstrate that the design is better, only that as-yet-secret designs are harder to exploit.
I nominate @raemon as our arbiter, if both he and you are willing, and the majority vote or nominee of the Lightcone team if Raemon is unavailable for some reason (and @habryka approves that).
(note: for @mentioning to work, you need to be in the LessWrong Docs editor, or in markdown you actually type out [@Raemon](https://www.lesswrong.com/users/raemon?mention=user) (the “@” in the “@Raemon” doesn’t actually do anything, the important part is that the url has mention=user. We should probably try to make this work more intuitively in markdown but it’s not quite obvious how to do it.)
I think in practice my take here would probably be something like a deferral to the Lightcone Team majority vote, since I’m not very informed on this field, but I’m happy to own the metacognition of making sure that happens and sanity checking the results.
I am a big fan of verification in principle, but don’t think it’s feasible in such broad contexts in practice. Most of these proposals are far too vague for me to have an opinion on, but I’d be happy to make bets that
(3) won’t happen (with nontrivial bounds) due to the difficulty of modeling nondeterministic GPUs, varying floating-point formats, etc.
(8) won’t be attempted, or will fail at some combination of design, manufacture, or just-being-pickable.
with some mutually agreed operationalization against the success of other listed ideas with a physical component
I’m afraid I’m not into betting and I’m not saying these projects will happen, but rather that they should happen. And I’m not saying they are easy but rather that they are important and I don’t see any huge barriers to doing them.
In general, non-deterministic hardware isn’t a problem for formalization. We’re not trying to simulate these systems, we’re trying to put bounds on their possible behaviors. In general, proofs work with enclosures that are guaranteed to contain the actual behavior. As far as I know, the various floating point formats are all precisely defined, so there’s no problem there. I know that some processor specs stated that their implementation of functions like sin and cos were only guaranteed up the last bit or something. That’s an annoyance but not a real problem for formalization. A simple approach is to maintain intervals which are guaranteed to contain the actual values and to prove that output intervals don’t overlap the unsafe region.
Probabilistic programs do provide greater challenges. It’s not a problem to formally model and prove properties of probability distributions. But it’s often no longer possible to provide 100% guarantees of desired safety properties. The approach of precise probabilistic theories like “Probably approximately correct learning” (https://en.wikipedia.org/wiki/Probably_approximately_correct_learning) is to provide a provable bound on the probability of failure.
There is subtlety in adversarial situations where there is a difference between uncertain actions which the adversary can’t affect for which an ordinary probabilistic analysis suffices and those where the adversary can influence apparently random variables. For example, I believe we need to create formal models of all of the engineering disciplines (eg. mechanical engineering, electrical engineering, chemical engineering, etc.). The rules that those engineers use are already quite formal and there are heuristic arguments as to why they are grounded in fundamental physics. But in an adversarial context, an adversary may be able to influence a deeper level of physics in a way that can break the assumptions of the model. For example, mechanical engineers have models of the elastic properties of steel beams but if an adversary can alter the manufacture of those beams in a way that makes them vulnerable to certain kinds of failure, then the mechanical engineer’s assumptions can be invalidated. So the formal models also need to model potential vulnerabilities in the assumptions of the higher level models.
I believe that these developments are not optional. It’s just a question of whether humanity prepares now for what’s coming or waits until after the basic assumptions of our infrastructure and society are upended.
A simple approach is to maintain intervals which are guaranteed to contain the actual values and to prove that output intervals don’t overlap the unsafe region.
For actual inference stacks in use (e.g. llama-3-405b 8 bit float) interval propagation will blow up massively and result in vacuous bounds. So, you’ll minimally have to assume that floating point approximation is non-adversarial aka random (because if it actually was adversarial, you would probably die!).
(My supporting argument for this is that you see huge blow ups on interval arguments for even tiny models on toy tasks[1] and that modern inference stacks have large amounts of clearly observable floating point non-determinism, which means that the maximum amount of deviation would actually have strong effects if all errors (including across token positions) were controlled by an adversary.)
Though note that the interval propagation issues in this case aren’t from floating point rounding and are instead from imperfections in the model weights implementing the algorithm.
Intervals are often a great simple form of “enclosure” in continuous domains. For simple functions there is also “interval arithmetic” which cheaply produces a bounding interval on the output of a function given intervals on its inputs: https://en.wikipedia.org/wiki/Interval_arithmetic But, as you say, for complex functions it can blow up. For a simple example of why, consider the function “f(x)=x-x” evaluated on the input interval [0,1]. In the simplest interval arithmetic, the interval for subtraction has to bound the worst possible members of the intervals of its two inputs. In this case that would be a lower bound of “0-1″ and an upper bound of “1-0” producing the resulting interval: [-1,1]. But, of course, “x-x” is always 0, so this is huge over-approximation. People have developed all kinds of techniques for capturing the correlations between variables in evaluating circuits on intervals. And you can always shrink the error by splitting the input intervals and doing “branch and bound”. But all of those are just particular implementation choices in proving bounds on the output of the function. Advanced AI theorem provers (like AlphaProof) can use very sophisticated techniques to accurately get the true bound on the output of a function.
But, it may be that it’s not a fruitful approach to try to bound the behavior of complex neural nets such as transformers. In our approach, we don’t need to understand or constrain a complex AI generating a solution or a control policy. Rather, we require the AI to generate a program, control policy, or simple network for taking actions in the situation of interest. And we force it to generate a proof that it satisfies given safety requirements. If it can’t do that, then it has no business taking actions in a dangerous setting.
Rather, we require the AI to generate a program, control policy, or simple network for taking actions in the situation of interest. And we force it to generate a proof that it satisfies given safety requirements. If it can’t do that, then it has no business taking actions in a dangerous setting.
This seems near certain to be cripplingly uncompetitive[1] even with massive effort on improving verification.
I agree you can do better than naive interval propagation by taking into account correlations. However, it will be tricky to get a much better bound while avoiding having this balloon in time complexity (all possible correlations requires exponentional time).
More strongly, I think that if an adversary controlled the non-determinism (e.g. summation order) in current efficient inference setups, they would actually be able to strongly influence the AI to an actualy dangerous extent—we are likely to depend on this non-determinism being non-adversarial (which is a reasonable assumption to make).
(And you can’t prove a false statement...)
See also heuristic arguments which try to resolve this sort of issue by assuming a lack of structure.
I do think it’s important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification—whatever actually works in practice.
for what it’s worth, I do see GSAI as largely a swiss cheesey worldview. Though I can see how you might read some of the authors involved to be implying otherwise! I should knock out a post on this.
I am a big fan of verification in principle, but don’t think it’s feasible in such broad contexts in practice. Most of these proposals are far too vague for me to have an opinion on, but I’d be happy to make bets with anyone interested that
(3) won’t happen (with nontrivial bounds) due to the difficulty of modeling nondeterministic GPUs, varying floating-point formats, etc.
(8) won’t be attempted, or will fail at some combination of design, manufacture, or just-being-pickable. This is a great proposal and a beautifully compact crux for the overall approach.
with some mutually agreed operationalization against the success of other listed ideas with a physical component
for up to 10k of my dollars at up to 10:1 odds (i.e. min your $1k), placed by end of 2024, to resolve by end of 2026 or as agreed. (I’m somewhat dubious about very long-term bets for implementation reasons) I’m also happy to discuss conditioning on a registered attempt.
For what it’s worth, I was offered and chose to decline co-authorship of Towards Guaranteed Safe AI. Despite my respect for the authors and strong support for this research agenda (as part of a diverse portfolio of safety research), I disagree with many of the central claims of the paper, and see it as uncritically enthusiastic about formal verification and overly dismissive of other approaches.
I do not believe that any of the proposed approaches actually guarantee safety. Safety is a system property,[1] which includes the surrounding social and technical context; and I see serious conceptual and in-practice-fatal challenges in defining both a world-model and a safety spec. (exercise: what could these look like for a biology lab? Try to block unsafe outcomes without blocking Kevin Esvelt’s research!)
I do think it’s important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification—whatever actually works in practice.
My standard recommendation on this is ch. 1-3 of Engineering a Safer World; I expect her recent book is excellent but haven’t yet read enough to recommend it. ↩︎
I agree with you that this feels like a ‘compact crux’ for many parts of the agenda. I’d like to take your bet, let me reflect if there’s any additional operationalizations or conditioning.
FWIW in Towards Guaranteed Safe AI I we endorse this: “Moreover, while we have argued for the need for verifiable quantitative safety guarantees, it is important to note that GS AI may not be the only route to achieving such guarantees. An alternative approach might be to extract interpretable
policies from black-box algorithms via automated mechanistic interpretability… it is ultimately an empirical question whether it is easier to create interpretable world models or interpretable policies in a given domain of operation.”
quick proposals:
I win at the end of 2026, if there has not been a formally-verified design for a mechanical lock, OR the design does not verify it cannot be mechanically picked, OR less than three consistent physical instances have been manufactured. (e.g. a total of three including prototypes or other designs doesn’t count)
You win if at the end of 2027, there have been credible and failed expert attempts to pick such a lock (e.g. an open challenge at Defcon). I win if there is a successful attempt.
Bet resolves neutral, and we each donate half our stakes to a mutally-agreed charity, if it’s unclear whether production actually happened, or there were no credible attempts to pick a verified lock.
Any disputes resolved by the best judgement of an agreed-in-advance arbiter; I’d be happy with the LessWrong team if you and they also agree.
Unsolicited suggestion: it is probably useful for y’all to define further what “pick a lock”means—e.g., if someone builds a custom defeat device of some sort, that does some sort of activity that is non-destructive but engages in a mechanical operation very surprising to someone thinking of traditional lock-picking methods—does that count?
(I think you’d probably say yes, so long as the device isn’t, e.g., a robot arm that’s nondestructively grabbing the master key for the lock out of Zac’s pocket and inserting it into the lock, but some sort of definining-in-advance would likely help.)
Nonetheless, think this would be awesome as an open challenge at Defcon (I suspect you can convince them to Black Badge the challenge...)
Well, a lot of the difficulty of any kind of formal proof is indeed that any specification you come up with will have holes you didn’t anticipate. As such, coming up with a solid definition of what “pick a lock” means is a large part of the difficulty of this formal verification endeavor, and as such I don’t think it makes sense to try to get that out of the way before the bet is even made. I think deferring to a trusted third arbiter whether the definition chosen for the proof is indeed an adequate definition would be a better choice.
I mean, I think it’s worth doing an initial loose and qualitative discussion to make sure that you’re thinking about overlapping spaces conceptually. Otherwise, not worth the more detailed effort.
This seems mostly good to me, thank you for the proposals (and sorry for my delayed response, this slipped my mind).
Why this condition? It doesn’t seem relevant to the core contention, and if someone prototyped a single lock using a GS AI approach but didn’t figure out how to manufacture it at scale, I’d still consider it to have been an important experiment.
Besides that, I’d agree to the above conditions!
I don’t think that a thing you can only manufacture once is a practically usable lock; having multiple is also practically useful to facilitate picking attempts and in case of damage—imagine that a few hours into an open pick-this-lock challenge, someone bent a part such that the key no longer opens the lock. I’d suggest resolving neutral in this case as we only saw an partial attempt.
Other conditions:
I think it’s important that the design could have at least a thousand distinct keys which are non-pickable. It’s fine if the theoretical keyspace is larger so long as the verified-secure keyspace is large enough to be useful, and distinct keys/locks need not be manufactured so long as they’re clearly possible.
I expect the design to be available in advance to people attempting to pick the lock, just as the design principles and detailed schematics of current mechanical locks are widely known—security through obscurity would not demonstrate that the design is better, only that as-yet-secret designs are harder to exploit.
I nominate @raemon as our arbiter, if both he and you are willing, and the majority vote or nominee of the Lightcone team if Raemon is unavailable for some reason (and @habryka approves that).
(note: for @mentioning to work, you need to be in the LessWrong Docs editor, or in markdown you actually type out
[@Raemon](https://www.lesswrong.com/users/raemon?mention=user
) (the “@” in the “@Raemon” doesn’t actually do anything, the important part is that the url hasmention=user
. We should probably try to make this work more intuitively in markdown but it’s not quite obvious how to do it.)I think in practice my take here would probably be something like a deferral to the Lightcone Team majority vote, since I’m not very informed on this field, but I’m happy to own the metacognition of making sure that happens and sanity checking the results.
That works for me—thanks very much for helping out!
@Raemon works for me; and I agree with the other conditions.
I think we’re agreed then, if you want to confirm the size? Then we wait for 2027!
Given your rationale I’m onboard for 3 or more consistent physical instances of the lock have been manufactured.
Lets ‘lock’ it in.
Nice! I look forward to seeing how this resolves.
Ah, by ‘size’ I meant the stakes, not the number of locks—did you want to bet the maximum $1k against my $10k, or some smaller proportional amount?
Ah gotcha, yes lets do my $1k against your $10k.
Locked in! Whichever way this goes, I expect to feel pretty good about both the process and the outcome :-)
I’m afraid I’m not into betting and I’m not saying these projects will happen, but rather that they should happen. And I’m not saying they are easy but rather that they are important and I don’t see any huge barriers to doing them.
In general, non-deterministic hardware isn’t a problem for formalization. We’re not trying to simulate these systems, we’re trying to put bounds on their possible behaviors. In general, proofs work with enclosures that are guaranteed to contain the actual behavior. As far as I know, the various floating point formats are all precisely defined, so there’s no problem there. I know that some processor specs stated that their implementation of functions like sin and cos were only guaranteed up the last bit or something. That’s an annoyance but not a real problem for formalization. A simple approach is to maintain intervals which are guaranteed to contain the actual values and to prove that output intervals don’t overlap the unsafe region.
Probabilistic programs do provide greater challenges. It’s not a problem to formally model and prove properties of probability distributions. But it’s often no longer possible to provide 100% guarantees of desired safety properties. The approach of precise probabilistic theories like “Probably approximately correct learning” (https://en.wikipedia.org/wiki/Probably_approximately_correct_learning) is to provide a provable bound on the probability of failure.
There is subtlety in adversarial situations where there is a difference between uncertain actions which the adversary can’t affect for which an ordinary probabilistic analysis suffices and those where the adversary can influence apparently random variables. For example, I believe we need to create formal models of all of the engineering disciplines (eg. mechanical engineering, electrical engineering, chemical engineering, etc.). The rules that those engineers use are already quite formal and there are heuristic arguments as to why they are grounded in fundamental physics. But in an adversarial context, an adversary may be able to influence a deeper level of physics in a way that can break the assumptions of the model. For example, mechanical engineers have models of the elastic properties of steel beams but if an adversary can alter the manufacture of those beams in a way that makes them vulnerable to certain kinds of failure, then the mechanical engineer’s assumptions can be invalidated. So the formal models also need to model potential vulnerabilities in the assumptions of the higher level models.
I believe that these developments are not optional. It’s just a question of whether humanity prepares now for what’s coming or waits until after the basic assumptions of our infrastructure and society are upended.
For actual inference stacks in use (e.g. llama-3-405b 8 bit float) interval propagation will blow up massively and result in vacuous bounds. So, you’ll minimally have to assume that floating point approximation is non-adversarial aka random (because if it actually was adversarial, you would probably die!).
(My supporting argument for this is that you see huge blow ups on interval arguments for even tiny models on toy tasks[1] and that modern inference stacks have large amounts of clearly observable floating point non-determinism, which means that the maximum amount of deviation would actually have strong effects if all errors (including across token positions) were controlled by an adversary.)
Though note that the interval propagation issues in this case aren’t from floating point rounding and are instead from imperfections in the model weights implementing the algorithm.
Intervals are often a great simple form of “enclosure” in continuous domains. For simple functions there is also “interval arithmetic” which cheaply produces a bounding interval on the output of a function given intervals on its inputs: https://en.wikipedia.org/wiki/Interval_arithmetic But, as you say, for complex functions it can blow up. For a simple example of why, consider the function “f(x)=x-x” evaluated on the input interval [0,1]. In the simplest interval arithmetic, the interval for subtraction has to bound the worst possible members of the intervals of its two inputs. In this case that would be a lower bound of “0-1″ and an upper bound of “1-0” producing the resulting interval: [-1,1]. But, of course, “x-x” is always 0, so this is huge over-approximation. People have developed all kinds of techniques for capturing the correlations between variables in evaluating circuits on intervals. And you can always shrink the error by splitting the input intervals and doing “branch and bound”. But all of those are just particular implementation choices in proving bounds on the output of the function. Advanced AI theorem provers (like AlphaProof) can use very sophisticated techniques to accurately get the true bound on the output of a function.
But, it may be that it’s not a fruitful approach to try to bound the behavior of complex neural nets such as transformers. In our approach, we don’t need to understand or constrain a complex AI generating a solution or a control policy. Rather, we require the AI to generate a program, control policy, or simple network for taking actions in the situation of interest. And we force it to generate a proof that it satisfies given safety requirements. If it can’t do that, then it has no business taking actions in a dangerous setting.
This seems near certain to be cripplingly uncompetitive[1] even with massive effort on improving verification.
If applied to all potentialy dangerous applications.
I agree you can do better than naive interval propagation by taking into account correlations. However, it will be tricky to get a much better bound while avoiding having this balloon in time complexity (all possible correlations requires exponentional time).
More strongly, I think that if an adversary controlled the non-determinism (e.g. summation order) in current efficient inference setups, they would actually be able to strongly influence the AI to an actualy dangerous extent—we are likely to depend on this non-determinism being non-adversarial (which is a reasonable assumption to make).
(And you can’t prove a false statement...)
See also heuristic arguments which try to resolve this sort of issue by assuming a lack of structure.
for what it’s worth, I do see GSAI as largely a swiss cheesey worldview. Though I can see how you might read some of the authors involved to be implying otherwise! I should knock out a post on this.
Mod note: this comment was originally written in Markdown in the rich text editor. I converted it. Let me know if that’s not what you want, Zac.
discussion of the bet in Aug 2024 Progress in GSAI newsletter