Thanks! I also thought o1 will be plenty of reason for other groups to pursue more chain-of-thought research. Agents are more than that, but dependent on CoT actually being useful. o1 shows that it can be made quite useful.
I am currently working on clearer and more comprehensive summaries of the suite of alignment plans I endorse—which curiously are the same ones I think orgs will actually employ (this is in part because I don’t spend time thinking about what we “should” do if it has large alignment taxes or is otherwise unlikely to actually be employed).
I agree that we’re unlikely to get any proofs. I don’t think any industry making safety cases has ever claimed a proof of safety, only sufficient arguments, calculations, and evidence. I do hope we get it down to a .1% but I think we’ll probably take much larger risks than that. Humans are impulsive and cognitively limited and typically neither utilitarian nor longtermist.
Good point about formal proofs forcing us to be explicit about assumptions. I am highly suspicious of formal proofs after noticing that every single time I dug into the background assumptions, they violated some pretty obvious conditions of the problem they were purportedly addressing. They seem to really draw people to “searching under the light”.
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
To nitpick a little (though I believe it’s an important nitpick):
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
Agree with most of this, but I see one potential scenario where it may matter, and that is essentially the case where certain AIs are essentially superhumanly reliable and superhumanly capable at both coding and mathematics like Lean 4, but otherwise aren’t that good in many domains, being able to formally prove large codebases unhackable at the software level, and it only doing what it’s supposed to be doing like a full behavioral specification, where an important assumption is that the hardware does correct operations, and we only prove the software layer correct.
This is a domain that I think is both reasonably tractable to automate, given the ability to make arbitrary training data with similar techniques to self-play, mostly because you can almost fully simulate software and mathematics like in Lean, as well as being able to ensure that you can easily verify a solution is correct, and also plausibly important in enough worlds to justify strategies that rely on computer security reducing AI risk, as well as AI control agendas.
This is still very, very hard and labor intensive, which is why AIs mostly have to automate it, but with enough control/alignment strategies stacked on each other, I think this could actually work.
A few worked examples of formal proofs in software:
I agree that software is a potential use-case for closed form proofs.
l thought their example of making a protein-creating system (or maybe it was a RNA creator) fully safe was interesting, because it seems like knowing what’s “toxic” would require a complete understanding of not only biology, but a complete understanding of which changes to the body humans do and don’t want. If even their chosen example seems utterly impossible, it doesn’t speak well for how thoroughly they’ve thought out the general proposal.
But yes, in the software domain it might actually work—conditions like “only entities with access to these keys should be allowed access to this system” seem simple enough to actually define to make closed form proofs relevant. And software security might make the world substantially safer in a multipolar scenario (although we should’ve forget that physical attacks are also quite possible).
The problem with their chosen domain mostly boils down to them either misestimating how hard quantifying all possible higher order behaviors the program doesn’t have, or they somehow have a solution and aren’t telling us that.
I like this comment as an articulation of the problem, and also some points about what formal proof systems can and can’t do:
If they knew of a path to being able to quantify all possible higher order behaviors in the proof system, I’d be more optimistic that their example would actually work IRL, but I don’t think this will happen, and be far more optimistic on software and hardware security overall.
I also like some of the hardware security discussion here, as this might well be used with formal proofs to make things even more secure and encrypted. (though formal proofs aren’t involved):
I agree that physical attacks means that it’s probably not possible to get formal proofs alone to state-level security, but I do think that it might allow them to jump up several levels in security from non-state actors, from being essentially able to control the AI through exfiltration to being unable to penetrate a code-base at all, at least until the world is entirely transformed.
I am of course assuming heavy use of AI labor here.
Thanks! I also thought o1 will be plenty of reason for other groups to pursue more chain-of-thought research. Agents are more than that, but dependent on CoT actually being useful. o1 shows that it can be made quite useful.
I am currently working on clearer and more comprehensive summaries of the suite of alignment plans I endorse—which curiously are the same ones I think orgs will actually employ (this is in part because I don’t spend time thinking about what we “should” do if it has large alignment taxes or is otherwise unlikely to actually be employed).
I agree that we’re unlikely to get any proofs. I don’t think any industry making safety cases has ever claimed a proof of safety, only sufficient arguments, calculations, and evidence. I do hope we get it down to a .1% but I think we’ll probably take much larger risks than that. Humans are impulsive and cognitively limited and typically neither utilitarian nor longtermist.
Good point about formal proofs forcing us to be explicit about assumptions. I am highly suspicious of formal proofs after noticing that every single time I dug into the background assumptions, they violated some pretty obvious conditions of the problem they were purportedly addressing. They seem to really draw people to “searching under the light”.
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
Agree with this.
To nitpick a little (though I believe it’s an important nitpick):
Agree with most of this, but I see one potential scenario where it may matter, and that is essentially the case where certain AIs are essentially superhumanly reliable and superhumanly capable at both coding and mathematics like Lean 4, but otherwise aren’t that good in many domains, being able to formally prove large codebases unhackable at the software level, and it only doing what it’s supposed to be doing like a full behavioral specification, where an important assumption is that the hardware does correct operations, and we only prove the software layer correct.
This is a domain that I think is both reasonably tractable to automate, given the ability to make arbitrary training data with similar techniques to self-play, mostly because you can almost fully simulate software and mathematics like in Lean, as well as being able to ensure that you can easily verify a solution is correct, and also plausibly important in enough worlds to justify strategies that rely on computer security reducing AI risk, as well as AI control agendas.
This is still very, very hard and labor intensive, which is why AIs mostly have to automate it, but with enough control/alignment strategies stacked on each other, I think this could actually work.
A few worked examples of formal proofs in software:
https://www.quantamagazine.org/formal-verification-creates-hacker-proof-code-20160920/
https://www.quantamagazine.org/how-the-evercrypt-library-creates-hacker-proof-cryptography-20190402/
I agree that software is a potential use-case for closed form proofs.
l thought their example of making a protein-creating system (or maybe it was a RNA creator) fully safe was interesting, because it seems like knowing what’s “toxic” would require a complete understanding of not only biology, but a complete understanding of which changes to the body humans do and don’t want. If even their chosen example seems utterly impossible, it doesn’t speak well for how thoroughly they’ve thought out the general proposal.
But yes, in the software domain it might actually work—conditions like “only entities with access to these keys should be allowed access to this system” seem simple enough to actually define to make closed form proofs relevant. And software security might make the world substantially safer in a multipolar scenario (although we should’ve forget that physical attacks are also quite possible).
The problem with their chosen domain mostly boils down to them either misestimating how hard quantifying all possible higher order behaviors the program doesn’t have, or they somehow have a solution and aren’t telling us that.
I like this comment as an articulation of the problem, and also some points about what formal proof systems can and can’t do:
https://www.lesswrong.com/posts/B2bg677TaS4cmDPzL/limitations-on-formal-verification-for-ai-safety#kPRnieFrEEifZjksa
If they knew of a path to being able to quantify all possible higher order behaviors in the proof system, I’d be more optimistic that their example would actually work IRL, but I don’t think this will happen, and be far more optimistic on software and hardware security overall.
I also like some of the hardware security discussion here, as this might well be used with formal proofs to make things even more secure and encrypted. (though formal proofs aren’t involved):
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#e5uSAJYmbcgpa9sAv
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#TFmNy5MfkrvKTZGiA
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#3Jnpurgrdip7rrK8v
I agree that physical attacks means that it’s probably not possible to get formal proofs alone to state-level security, but I do think that it might allow them to jump up several levels in security from non-state actors, from being essentially able to control the AI through exfiltration to being unable to penetrate a code-base at all, at least until the world is entirely transformed.
I am of course assuming heavy use of AI labor here.