I definitely want to see more detailed write-ups of your alignment agenda, and I agree with a lot of this comment.
To respond to some things:
having such a plan does not constitute a safety case, it is just the start of one.
Agree with this, primarily because there would be a lot more detail necessary to make it anywhere close to a safety case after accounting for empirical reality. If this was much more fleshed out and people gave good arguments for why this would avoid AI harming us through misalignment, which I only partially did in my post arguing against a list of lethalities, and is not enough to bring the chance of AI killing us all down to say, 0.1%, so others will need to provide much more detail on how they would know the AI was safe, or make a safety case, because my post alone is not enough for a safety case.
I apologize that it’s not a better writeup. I haven’t yet iterated on it or promoted it, in part because talking about how to align LLM agents in sufficient detail includes talking about how to build LLM agents, and why they’re likely to get all the way to real AGI. I haven’t wanted to speed up the race. I think this is increasingly irrelevant since many people and teams have the same basic ideas, so I’ll be publishing more detailed and clearer writeups soon.
Unfortunately, I believe that the time for your ideas having sped up the race counterfactually is already past, and I believe o1 is a sign that the LLM agent direction will soon be worked on by companies, so it’s worth writing up your complete thoughts on how to align LLM agents now.
I also responded to Max Tegmark, and my view is that I find the formal proof direction to be very intractable and also unnecessary, and I think the existential quantifier both applies to doom and alignment/control plans.
One thing I’ve changed my mind about on a little is I think that to the extent that formal proofs make us honest about which assumptions we use, rather than trying to prove non-trivial things, I’d be happy, and formal proof at this stage is best used to document our assumptions, not to prove anything.
Thanks! I also thought o1 will be plenty of reason for other groups to pursue more chain-of-thought research. Agents are more than that, but dependent on CoT actually being useful. o1 shows that it can be made quite useful.
I am currently working on clearer and more comprehensive summaries of the suite of alignment plans I endorse—which curiously are the same ones I think orgs will actually employ (this is in part because I don’t spend time thinking about what we “should” do if it has large alignment taxes or is otherwise unlikely to actually be employed).
I agree that we’re unlikely to get any proofs. I don’t think any industry making safety cases has ever claimed a proof of safety, only sufficient arguments, calculations, and evidence. I do hope we get it down to a .1% but I think we’ll probably take much larger risks than that. Humans are impulsive and cognitively limited and typically neither utilitarian nor longtermist.
Good point about formal proofs forcing us to be explicit about assumptions. I am highly suspicious of formal proofs after noticing that every single time I dug into the background assumptions, they violated some pretty obvious conditions of the problem they were purportedly addressing. They seem to really draw people to “searching under the light”.
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
To nitpick a little (though I believe it’s an important nitpick):
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
Agree with most of this, but I see one potential scenario where it may matter, and that is essentially the case where certain AIs are essentially superhumanly reliable and superhumanly capable at both coding and mathematics like Lean 4, but otherwise aren’t that good in many domains, being able to formally prove large codebases unhackable at the software level, and it only doing what it’s supposed to be doing like a full behavioral specification, where an important assumption is that the hardware does correct operations, and we only prove the software layer correct.
This is a domain that I think is both reasonably tractable to automate, given the ability to make arbitrary training data with similar techniques to self-play, mostly because you can almost fully simulate software and mathematics like in Lean, as well as being able to ensure that you can easily verify a solution is correct, and also plausibly important in enough worlds to justify strategies that rely on computer security reducing AI risk, as well as AI control agendas.
This is still very, very hard and labor intensive, which is why AIs mostly have to automate it, but with enough control/alignment strategies stacked on each other, I think this could actually work.
A few worked examples of formal proofs in software:
I agree that software is a potential use-case for closed form proofs.
l thought their example of making a protein-creating system (or maybe it was a RNA creator) fully safe was interesting, because it seems like knowing what’s “toxic” would require a complete understanding of not only biology, but a complete understanding of which changes to the body humans do and don’t want. If even their chosen example seems utterly impossible, it doesn’t speak well for how thoroughly they’ve thought out the general proposal.
But yes, in the software domain it might actually work—conditions like “only entities with access to these keys should be allowed access to this system” seem simple enough to actually define to make closed form proofs relevant. And software security might make the world substantially safer in a multipolar scenario (although we should’ve forget that physical attacks are also quite possible).
The problem with their chosen domain mostly boils down to them either misestimating how hard quantifying all possible higher order behaviors the program doesn’t have, or they somehow have a solution and aren’t telling us that.
I like this comment as an articulation of the problem, and also some points about what formal proof systems can and can’t do:
If they knew of a path to being able to quantify all possible higher order behaviors in the proof system, I’d be more optimistic that their example would actually work IRL, but I don’t think this will happen, and be far more optimistic on software and hardware security overall.
I also like some of the hardware security discussion here, as this might well be used with formal proofs to make things even more secure and encrypted. (though formal proofs aren’t involved):
I agree that physical attacks means that it’s probably not possible to get formal proofs alone to state-level security, but I do think that it might allow them to jump up several levels in security from non-state actors, from being essentially able to control the AI through exfiltration to being unable to penetrate a code-base at all, at least until the world is entirely transformed.
I am of course assuming heavy use of AI labor here.
I definitely want to see more detailed write-ups of your alignment agenda, and I agree with a lot of this comment.
To respond to some things:
Agree with this, primarily because there would be a lot more detail necessary to make it anywhere close to a safety case after accounting for empirical reality. If this was much more fleshed out and people gave good arguments for why this would avoid AI harming us through misalignment, which I only partially did in my post arguing against a list of lethalities, and is not enough to bring the chance of AI killing us all down to say, 0.1%, so others will need to provide much more detail on how they would know the AI was safe, or make a safety case, because my post alone is not enough for a safety case.
Link below for completeness:
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities
On this:
Unfortunately, I believe that the time for your ideas having sped up the race counterfactually is already past, and I believe o1 is a sign that the LLM agent direction will soon be worked on by companies, so it’s worth writing up your complete thoughts on how to align LLM agents now.
I also responded to Max Tegmark, and my view is that I find the formal proof direction to be very intractable and also unnecessary, and I think the existential quantifier both applies to doom and alignment/control plans.
One thing I’ve changed my mind about on a little is I think that to the extent that formal proofs make us honest about which assumptions we use, rather than trying to prove non-trivial things, I’d be happy, and formal proof at this stage is best used to document our assumptions, not to prove anything.
More below:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-agi-entente-delusion#ST53bdgKERz6asrsi
Thanks! I also thought o1 will be plenty of reason for other groups to pursue more chain-of-thought research. Agents are more than that, but dependent on CoT actually being useful. o1 shows that it can be made quite useful.
I am currently working on clearer and more comprehensive summaries of the suite of alignment plans I endorse—which curiously are the same ones I think orgs will actually employ (this is in part because I don’t spend time thinking about what we “should” do if it has large alignment taxes or is otherwise unlikely to actually be employed).
I agree that we’re unlikely to get any proofs. I don’t think any industry making safety cases has ever claimed a proof of safety, only sufficient arguments, calculations, and evidence. I do hope we get it down to a .1% but I think we’ll probably take much larger risks than that. Humans are impulsive and cognitively limited and typically neither utilitarian nor longtermist.
Good point about formal proofs forcing us to be explicit about assumptions. I am highly suspicious of formal proofs after noticing that every single time I dug into the background assumptions, they violated some pretty obvious conditions of the problem they were purportedly addressing. They seem to really draw people to “searching under the light”.
It is worth noting that Omohundro & Tegmark’s recent high-profile paper really only endorsed provably safe systems that were not AGI, but to be used by AGIs (their example was a gene sequencer that would refuse to produce harmful compounds). I think even that is probably unworkable. And applying closed-form proofs to the behavior of an AGI seems impossible to me. But I’m not an expert, and I’d like to see someone at least try — as you say, it would at least clarify assumptions.
Agree with this.
To nitpick a little (though I believe it’s an important nitpick):
Agree with most of this, but I see one potential scenario where it may matter, and that is essentially the case where certain AIs are essentially superhumanly reliable and superhumanly capable at both coding and mathematics like Lean 4, but otherwise aren’t that good in many domains, being able to formally prove large codebases unhackable at the software level, and it only doing what it’s supposed to be doing like a full behavioral specification, where an important assumption is that the hardware does correct operations, and we only prove the software layer correct.
This is a domain that I think is both reasonably tractable to automate, given the ability to make arbitrary training data with similar techniques to self-play, mostly because you can almost fully simulate software and mathematics like in Lean, as well as being able to ensure that you can easily verify a solution is correct, and also plausibly important in enough worlds to justify strategies that rely on computer security reducing AI risk, as well as AI control agendas.
This is still very, very hard and labor intensive, which is why AIs mostly have to automate it, but with enough control/alignment strategies stacked on each other, I think this could actually work.
A few worked examples of formal proofs in software:
https://www.quantamagazine.org/formal-verification-creates-hacker-proof-code-20160920/
https://www.quantamagazine.org/how-the-evercrypt-library-creates-hacker-proof-cryptography-20190402/
I agree that software is a potential use-case for closed form proofs.
l thought their example of making a protein-creating system (or maybe it was a RNA creator) fully safe was interesting, because it seems like knowing what’s “toxic” would require a complete understanding of not only biology, but a complete understanding of which changes to the body humans do and don’t want. If even their chosen example seems utterly impossible, it doesn’t speak well for how thoroughly they’ve thought out the general proposal.
But yes, in the software domain it might actually work—conditions like “only entities with access to these keys should be allowed access to this system” seem simple enough to actually define to make closed form proofs relevant. And software security might make the world substantially safer in a multipolar scenario (although we should’ve forget that physical attacks are also quite possible).
The problem with their chosen domain mostly boils down to them either misestimating how hard quantifying all possible higher order behaviors the program doesn’t have, or they somehow have a solution and aren’t telling us that.
I like this comment as an articulation of the problem, and also some points about what formal proof systems can and can’t do:
https://www.lesswrong.com/posts/B2bg677TaS4cmDPzL/limitations-on-formal-verification-for-ai-safety#kPRnieFrEEifZjksa
If they knew of a path to being able to quantify all possible higher order behaviors in the proof system, I’d be more optimistic that their example would actually work IRL, but I don’t think this will happen, and be far more optimistic on software and hardware security overall.
I also like some of the hardware security discussion here, as this might well be used with formal proofs to make things even more secure and encrypted. (though formal proofs aren’t involved):
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#e5uSAJYmbcgpa9sAv
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#TFmNy5MfkrvKTZGiA
https://www.lesswrong.com/posts/nP5FFYFjtY8LgWymt/#3Jnpurgrdip7rrK8v
I agree that physical attacks means that it’s probably not possible to get formal proofs alone to state-level security, but I do think that it might allow them to jump up several levels in security from non-state actors, from being essentially able to control the AI through exfiltration to being unable to penetrate a code-base at all, at least until the world is entirely transformed.
I am of course assuming heavy use of AI labor here.