Thanks Noosphere89 for your long and thoughtful comment! I don’t have time to respond to everything before putting my 1-year-old to bed, but here are some brief comments.
1) Although I appreciate that you wrote out a proposed AGI alignment plan, I think you’ll agree that it contains no theorems or proofs, or even quantitative risk bounds. Since we insist on quantitative risk bounds before allowing much less dangerous technology such as airplanes and nuclear reactors, my view is that it would be crazy to launch AGI without quantitative risk bounds—especially when you’re dealing with a super-human mind that might actively optimize against vulnerabilities of the alignment system. As you know, rigorously ensuring retained alignment under recursive self-improvement is extremely difficult. For example, MIRI had highly talented researchers work on this for many years without completing the tast.
2) The point you make about fear of 1984 vs fear of extinction. However, if someone assicns P(1984) >> P(extinction) and there’s no convincing plan for preventing AGI loss-of-control, then I’d argue that it’s still crazy of them (or for China) to build AGI. So they’d both forge ahead with increasingly powerful yet controllable tool AGI, presumably remaining in a today’s mutually-asssured destruction paradign where neither has an incentive to try to conquer the other.
I have yet to hear a version of the “but China!” argument that makes any sense if you believe that the AGI race is a suicide race rather than a traditional armsrace. Those I hear making it are usually people who also dismiss the AGI extinction risk. If anything, the current Chinese leadership seems more concerned about AI xrisk than Western leaders.
I can wait for your response, so don’t take this as meaning you need to respond immediately, but I do have some comments.
After you are done with everything, I invite you to respond to this comment.
In response to 1, I definitely didn’t show quantitative risk bounds for my proposal, for a couple of reasons:
1, my alignment proposal would require a lot more work and concreteness than I was able to do, and my goal was to make an alignment proposal that was concrete enough for other people to fill in the details of how it could actually be done.
Then again, that’s why they are paid the big bucks and not me.
2, I am both much more skeptical of formal proof/verification for AGI safety than you are, and also believe that it is unnecessary to do formal proofs to get high confidence in an alignment plan working (though I do think that formal proof may, emphasis on may be a thing that is useful for AI control metastrategies).
For an example, I currently consider the Provably Safe AI agenda by Steve Omohundro and Ben Goldhaber at this time to be far too ambitious, and the biggest issue IMO is that the things they are promising rely on being able to quantify over all higher order behaviors that a system doesn’t have, which is out of the range of currently extrapolated formalization techniques, and Zach Hatfield Dodds and Ben Goldhaber bet on whether 3 locks that couldn’t be illegitimately unlocked could be designed by formal proof, where Zach Hatfield Dodds bet no, and Ben Goldhaber said yes, and the bet will resolve in 2027.
I support more use of quantitative risk estimation in general, and would plausibly support a policy on forcing AI developers to estimate that their AI say has less than a 1% chance of ending the world, but I don’t think it’s crazy to not use quantitative formal proofs of AI alignment/control at this stage, and instead argue for more swiss-cheese style safety.
Another thing that influences me is I basically make 0 update from MIRI failing to solve the AI alignment problem as a sign that other groups will fail, mostly because I think they made far less progress than basically every other group, to the point where I think that Pretraining from Human Feedback made more progress on the alignment problem than basically all of MIRI’s work and their plans were IMO fairly doomed even in a hypothetical world where alignment is easy, since they restrained their techniques too much and didn’t touch reality at all.
So I disagree with this being a substantial update:
For example, MIRI had highly talented researchers work on this for many years without completing the tast.
On this claim specifically:
2) The point you make about fear of 1984 vs fear of extinction. However, if someone assicns P(1984) >> P(extinction) and there’s no convincing plan for preventing AGI loss-of-control, then I’d argue that it’s still crazy of them (or for China) to build AGI. So they’d both forge ahead with increasingly powerful yet controllable tool AGI, presumably remaining in a today’s mutually-asssured destruction paradign where neither has an incentive to try to conquer the other.
I have yet to hear a version of the “but China!” argument that makes any sense if you believe that the AGI race is a suicide race rather than a traditional armsrace. Those I hear making it are usually people who also dismiss the AGI extinction risk. If anything, the current Chinese leadership seems more concerned about AI xrisk than Western leaders.
I agree that this is not an argument for AI companies to race to AGI, but I consider the evidence for China being more concerned than the West from your article to be reasonably weak evidence, and I think that this could plausibly not convince someone who’s p(1984) is >> p(extinction) for this reason.
It isn’t formal verification in the same sense of the word but rather probabilistic verification if that makes sense?
You could then apply something like control theory methods to ensure that the expected divergence from the heuristic is less than a certain percentage in different places. In the limit it seems to me that this could be convergent towards formal verification proofs, it’s almost like swiss cheese style on the model level?
(Yes, this comment is a bit random with respect to the rest of the context but I find it an interesting question for control in terms of formal verification and it seemed like you might have some interesting takes here.)
You know what, I’ve identified a scenario where formal verification is both tractable and helps reduce AI risk, and the broad answer is making our codebases way more secure and controllable, assuming heavy AI automation of mathematics and coding is achieved (which I expect to happen before they can do everything else, as it’s a domain with strong feedback loops, easy verification against ground truth possible, and you can get very large amounts of data to continually improve.)
I’ve also discussed more about the value and limits of formal proofs in another comment below, but short answer, it’s probably really helpful in an infrastructure sense, but not so much as a means to make anything other than software and mathematics safe and formally specified (which would be a huge win if we could do that in itself), but we will not be able to prove stuff like a piece of software isn’t a threat to something else in the world entirely, and that also applies to biology in say determining whether a gene or virus will harm humans, mostly because we don’t have a path to quantify all possible higher-order behaviors a system doesn’t have:
My take on this is I’d be interested to see how the research goes, and there may be value in doing this approach, and I think that this may be a useful way to get a quantitative estimate/bound in the future, because it relaxes it’s goals.
I’d like to see what eventually happens for this research direction:
Could we reliably give heuristic arguments for neural networks when proofs failed, or is it too hard to provide relevant arguments?
I do want to say that on formal verification/proof itself, I think the most useful application is not proving non-trivial things, but rather to keep ourselves honest about the assumptions we are using.
Thanks Noosphere89 for your long and thoughtful comment! I don’t have time to respond to everything before putting my 1-year-old to bed, but here are some brief comments.
1) Although I appreciate that you wrote out a proposed AGI alignment plan, I think you’ll agree that it contains no theorems or proofs, or even quantitative risk bounds. Since we insist on quantitative risk bounds before allowing much less dangerous technology such as airplanes and nuclear reactors, my view is that it would be crazy to launch AGI without quantitative risk bounds—especially when you’re dealing with a super-human mind that might actively optimize against vulnerabilities of the alignment system. As you know, rigorously ensuring retained alignment under recursive self-improvement is extremely difficult. For example, MIRI had highly talented researchers work on this for many years without completing the tast.
2) The point you make about fear of 1984 vs fear of extinction. However, if someone assicns P(1984) >> P(extinction) and there’s no convincing plan for preventing AGI loss-of-control, then I’d argue that it’s still crazy of them (or for China) to build AGI. So they’d both forge ahead with increasingly powerful yet controllable tool AGI, presumably remaining in a today’s mutually-asssured destruction paradign where neither has an incentive to try to conquer the other.
I have yet to hear a version of the “but China!” argument that makes any sense if you believe that the AGI race is a suicide race rather than a traditional armsrace. Those I hear making it are usually people who also dismiss the AGI extinction risk. If anything, the current Chinese leadership seems more concerned about AI xrisk than Western leaders.
I can wait for your response, so don’t take this as meaning you need to respond immediately, but I do have some comments.
After you are done with everything, I invite you to respond to this comment.
In response to 1, I definitely didn’t show quantitative risk bounds for my proposal, for a couple of reasons:
1, my alignment proposal would require a lot more work and concreteness than I was able to do, and my goal was to make an alignment proposal that was concrete enough for other people to fill in the details of how it could actually be done.
Then again, that’s why they are paid the big bucks and not me.
2, I am both much more skeptical of formal proof/verification for AGI safety than you are, and also believe that it is unnecessary to do formal proofs to get high confidence in an alignment plan working (though I do think that formal proof may, emphasis on may be a thing that is useful for AI control metastrategies).
For an example, I currently consider the Provably Safe AI agenda by Steve Omohundro and Ben Goldhaber at this time to be far too ambitious, and the biggest issue IMO is that the things they are promising rely on being able to quantify over all higher order behaviors that a system doesn’t have, which is out of the range of currently extrapolated formalization techniques, and Zach Hatfield Dodds and Ben Goldhaber bet on whether 3 locks that couldn’t be illegitimately unlocked could be designed by formal proof, where Zach Hatfield Dodds bet no, and Ben Goldhaber said yes, and the bet will resolve in 2027.
See these links for more:
https://www.lesswrong.com/posts/B2bg677TaS4cmDPzL/limitations-on-formal-verification-for-ai-safety#kPRnieFrEEifZjksa
https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects#Ku3X4QDBSyZhrtxkM
https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects#jjFsFmLbKNtMRyttK
https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects#Ght9hffumLkjxxNaw
I support more use of quantitative risk estimation in general, and would plausibly support a policy on forcing AI developers to estimate that their AI say has less than a 1% chance of ending the world, but I don’t think it’s crazy to not use quantitative formal proofs of AI alignment/control at this stage, and instead argue for more swiss-cheese style safety.
Another thing that influences me is I basically make 0 update from MIRI failing to solve the AI alignment problem as a sign that other groups will fail, mostly because I think they made far less progress than basically every other group, to the point where I think that Pretraining from Human Feedback made more progress on the alignment problem than basically all of MIRI’s work and their plans were IMO fairly doomed even in a hypothetical world where alignment is easy, since they restrained their techniques too much and didn’t touch reality at all.
So I disagree with this being a substantial update:
On this claim specifically:
I agree that this is not an argument for AI companies to race to AGI, but I consider the evidence for China being more concerned than the West from your article to be reasonably weak evidence, and I think that this could plausibly not convince someone who’s p(1984) is >> p(extinction) for this reason.
When it comes to formal verification I’m curious what you think about the heuristic argument line of research that ARC are approaching?:
https://www.lesswrong.com/posts/QA3cmgNtNriMpxQgo/research-update-towards-a-law-of-iterated-expectations-for
It isn’t formal verification in the same sense of the word but rather probabilistic verification if that makes sense?
You could then apply something like control theory methods to ensure that the expected divergence from the heuristic is less than a certain percentage in different places. In the limit it seems to me that this could be convergent towards formal verification proofs, it’s almost like swiss cheese style on the model level?
(Yes, this comment is a bit random with respect to the rest of the context but I find it an interesting question for control in terms of formal verification and it seemed like you might have some interesting takes here.)
You know what, I’ve identified a scenario where formal verification is both tractable and helps reduce AI risk, and the broad answer is making our codebases way more secure and controllable, assuming heavy AI automation of mathematics and coding is achieved (which I expect to happen before they can do everything else, as it’s a domain with strong feedback loops, easy verification against ground truth possible, and you can get very large amounts of data to continually improve.)
Here are the links below:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-hopium-wars-the-agi-entente-delusion#sgH9iCyon55yQyDqF
I’ve also discussed more about the value and limits of formal proofs in another comment below, but short answer, it’s probably really helpful in an infrastructure sense, but not so much as a means to make anything other than software and mathematics safe and formally specified (which would be a huge win if we could do that in itself), but we will not be able to prove stuff like a piece of software isn’t a threat to something else in the world entirely, and that also applies to biology in say determining whether a gene or virus will harm humans, mostly because we don’t have a path to quantify all possible higher-order behaviors a system doesn’t have:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-hopium-wars-the-agi-entente-delusion#2GDjfZTJ8AZrh9i7Y
My take on this is I’d be interested to see how the research goes, and there may be value in doing this approach, and I think that this may be a useful way to get a quantitative estimate/bound in the future, because it relaxes it’s goals.
I’d like to see what eventually happens for this research direction:
Could we reliably give heuristic arguments for neural networks when proofs failed, or is it too hard to provide relevant arguments?
I do want to say that on formal verification/proof itself, I think the most useful application is not proving non-trivial things, but rather to keep ourselves honest about the assumptions we are using.