I think one very, very crucial crux that seems valuable to be unearthed more is whether loss of control risk is higher/more dangerous than ~eternal authoritarianism.
While Dario Amodei does have a 5-25% chance of p(Doom), I suspect his chances of p(Eternal Authoritarianism) is closer to 50-75% conditional on no intervention to change that, and I think that AI people that are very concerned about AI risk tend to view loss of control risk as very high, while eternal authoritarianism risks are much lower.
On the narrower subclass of the alignment problem you mentioned:
There has been a major research effort on “alignment” redefined in a much narrower way: ensuring that a large language model (LLM) does not produce outputs deemed harmful, such as offensive slurs or bioweapon instructions. But most work on this has involved only training LLMs not to say bad things rather than not to want bad things. This is like training a hard-core Nazi never to say anything revealing his Nazi views – does this really solve the problem, or simply produce deceptive AI? Many AI systems have already been found to be deceptive, and current LLM blackbox evaluation techniques are likely to be inadequate. Even if alignment can be achieved in this strangely narrowly defined sense, it is clearly a far cry from what is needed: aligning the goals of AGI.
I think that the solutions that companies use like RLHF are mostly bad, but I’m not going to say that it’s maximally bad, because while it does violate the principle that alignment data should come during or before capabilities data, as it’s a post-training operation, I do think that in the most easy worlds, we come out fine despite not having much dignity.
Also, on the paper on AI systems being deceptive, while I like the paper for what it is, it doesn’t nearly give as much evidence on AI risk, because looking at the examples, 2 of the examples were cases where the AI companies didn’t even try to prevent deceptive strategies, the GPT-4 examples are capability evaluations, not alignment evaluations (capability evaluations are very good, I just think that a lot of the press is going to predictably mislead people), 1 of the examples are from something where I suspect they are doing simulated evolution, which is of no relevance to AGI, 1 example could plausibly be solved by capabilities people alone, and only 2 examples give any evidence for the claim that AI will be deceptive at all.
Overall, this doesn’t support the claim that you made here.
There is other work like Pretraining from Human Feedback which IMO is both conceptually and empirically better than RLHF, mostly because it both integrates feedback into training, and prevents the failure mode of a very capable AI gaming/reward hacking the training process as described here:
If you disagree with my assertion, I challenge you to cite or openly publish an actual plan for aligning or controlling a hybrid AGI system. If companies claim to have a plan that they do not want their competitors to see, I will argue that they are lying: if they lose the AGI race, they are clearly better off if their competitors align/control their AGI instead of Earth getting taken over by unaligned machines.
As someone not affiliated with big tech companies, my plans for alignment look like the comments/links down below, since I’ve sketched a vision for how we’d align an AGI that is essentially an anytime plan in the spirit of @ryan_greenblatt’s post on what an alignment plan would look like if transformative AI, also called AGI, was developed in 1 year, say:
The summary of the alignment plan is that what you’d do is first gather up large synthetic datasets or make them up yourself on demand that show human values, and then use it as either a direct data input into the hybrid LLM AGI before it gains so much capabilities that it can hack the human evaluator, or use the synthetic datasets to define an explicit reward function in the style of Model-based RL.
The key principle is to train it to be aligned before or during when it becomes more capable, not after it’s superhumanly capable, which means post-training methods like RLHF are not a good alignment strategy.
The more complicated story will be in a series of links below:
One important point often left out of alignment discussions, but what I consider necessary to raise, is that an AI’s motivations as we progress to superintelligence should be more based around fictional characters like angels, devas, superheroic AIs, totemic spirits, matron deities etc, primarily because human control of AI can get very scary very fast, and I think that the problem of technical control is less important nowadays than the problem of human people controlling AIs and creating terrible equlibriums for everyone else while they profit.
It shares a lot with the bitter lesson approach to AI alignment advocated by @RogerDearnaley, but with more of a focus on data and compute, and less focus on the complexities of how humans mechanistically instantiate values:
Now that I’m done presenting my own plan, I want to address 1 more issue.
I have issues with this:
The two traditional approaches are either figuring out how to control something much smarter than us via formal verification or other techniques, or do make control unnecessary by “aligning” it: ensuring that it has goals aligned with humanity’s best interests, and that it will retain these goals even if it recursively self-improves its intelligence from roughly human level to astronomically higher levels allowed by the laws of physics.
I feel like this is surveying alignment/control techniques with too high of an abstraction level to be useful, and just as there is an existential quantifier on AI being dangerous, there is also an existential quantifier in alignment/control methods such that if 1 technique worked, we be way, way safer from AI risk.
More generally, my response to this section is that just as there is a garden of forking paths for AI capabilities, where only 1 approach out of many needs to work, a similar garden of forking paths exist for AI alignment where only a few approaches need to work, no matter how many methods don’t work in order for AI safety to be achieved:
Thanks Noosphere89 for your long and thoughtful comment! I don’t have time to respond to everything before putting my 1-year-old to bed, but here are some brief comments.
1) Although I appreciate that you wrote out a proposed AGI alignment plan, I think you’ll agree that it contains no theorems or proofs, or even quantitative risk bounds. Since we insist on quantitative risk bounds before allowing much less dangerous technology such as airplanes and nuclear reactors, my view is that it would be crazy to launch AGI without quantitative risk bounds—especially when you’re dealing with a super-human mind that might actively optimize against vulnerabilities of the alignment system. As you know, rigorously ensuring retained alignment under recursive self-improvement is extremely difficult. For example, MIRI had highly talented researchers work on this for many years without completing the tast.
2) The point you make about fear of 1984 vs fear of extinction. However, if someone assicns P(1984) >> P(extinction) and there’s no convincing plan for preventing AGI loss-of-control, then I’d argue that it’s still crazy of them (or for China) to build AGI. So they’d both forge ahead with increasingly powerful yet controllable tool AGI, presumably remaining in a today’s mutually-asssured destruction paradign where neither has an incentive to try to conquer the other.
I have yet to hear a version of the “but China!” argument that makes any sense if you believe that the AGI race is a suicide race rather than a traditional armsrace. Those I hear making it are usually people who also dismiss the AGI extinction risk. If anything, the current Chinese leadership seems more concerned about AI xrisk than Western leaders.
I can wait for your response, so don’t take this as meaning you need to respond immediately, but I do have some comments.
After you are done with everything, I invite you to respond to this comment.
In response to 1, I definitely didn’t show quantitative risk bounds for my proposal, for a couple of reasons:
1, my alignment proposal would require a lot more work and concreteness than I was able to do, and my goal was to make an alignment proposal that was concrete enough for other people to fill in the details of how it could actually be done.
Then again, that’s why they are paid the big bucks and not me.
2, I am both much more skeptical of formal proof/verification for AGI safety than you are, and also believe that it is unnecessary to do formal proofs to get high confidence in an alignment plan working (though I do think that formal proof may, emphasis on may be a thing that is useful for AI control metastrategies).
For an example, I currently consider the Provably Safe AI agenda by Steve Omohundro and Ben Goldhaber at this time to be far too ambitious, and the biggest issue IMO is that the things they are promising rely on being able to quantify over all higher order behaviors that a system doesn’t have, which is out of the range of currently extrapolated formalization techniques, and Zach Hatfield Dodds and Ben Goldhaber bet on whether 3 locks that couldn’t be illegitimately unlocked could be designed by formal proof, where Zach Hatfield Dodds bet no, and Ben Goldhaber said yes, and the bet will resolve in 2027.
I support more use of quantitative risk estimation in general, and would plausibly support a policy on forcing AI developers to estimate that their AI say has less than a 1% chance of ending the world, but I don’t think it’s crazy to not use quantitative formal proofs of AI alignment/control at this stage, and instead argue for more swiss-cheese style safety.
Another thing that influences me is I basically make 0 update from MIRI failing to solve the AI alignment problem as a sign that other groups will fail, mostly because I think they made far less progress than basically every other group, to the point where I think that Pretraining from Human Feedback made more progress on the alignment problem than basically all of MIRI’s work and their plans were IMO fairly doomed even in a hypothetical world where alignment is easy, since they restrained their techniques too much and didn’t touch reality at all.
So I disagree with this being a substantial update:
For example, MIRI had highly talented researchers work on this for many years without completing the tast.
On this claim specifically:
2) The point you make about fear of 1984 vs fear of extinction. However, if someone assicns P(1984) >> P(extinction) and there’s no convincing plan for preventing AGI loss-of-control, then I’d argue that it’s still crazy of them (or for China) to build AGI. So they’d both forge ahead with increasingly powerful yet controllable tool AGI, presumably remaining in a today’s mutually-asssured destruction paradign where neither has an incentive to try to conquer the other.
I have yet to hear a version of the “but China!” argument that makes any sense if you believe that the AGI race is a suicide race rather than a traditional armsrace. Those I hear making it are usually people who also dismiss the AGI extinction risk. If anything, the current Chinese leadership seems more concerned about AI xrisk than Western leaders.
I agree that this is not an argument for AI companies to race to AGI, but I consider the evidence for China being more concerned than the West from your article to be reasonably weak evidence, and I think that this could plausibly not convince someone who’s p(1984) is >> p(extinction) for this reason.
It isn’t formal verification in the same sense of the word but rather probabilistic verification if that makes sense?
You could then apply something like control theory methods to ensure that the expected divergence from the heuristic is less than a certain percentage in different places. In the limit it seems to me that this could be convergent towards formal verification proofs, it’s almost like swiss cheese style on the model level?
(Yes, this comment is a bit random with respect to the rest of the context but I find it an interesting question for control in terms of formal verification and it seemed like you might have some interesting takes here.)
You know what, I’ve identified a scenario where formal verification is both tractable and helps reduce AI risk, and the broad answer is making our codebases way more secure and controllable, assuming heavy AI automation of mathematics and coding is achieved (which I expect to happen before they can do everything else, as it’s a domain with strong feedback loops, easy verification against ground truth possible, and you can get very large amounts of data to continually improve.)
I’ve also discussed more about the value and limits of formal proofs in another comment below, but short answer, it’s probably really helpful in an infrastructure sense, but not so much as a means to make anything other than software and mathematics safe and formally specified (which would be a huge win if we could do that in itself), but we will not be able to prove stuff like a piece of software isn’t a threat to something else in the world entirely, and that also applies to biology in say determining whether a gene or virus will harm humans, mostly because we don’t have a path to quantify all possible higher-order behaviors a system doesn’t have:
My take on this is I’d be interested to see how the research goes, and there may be value in doing this approach, and I think that this may be a useful way to get a quantitative estimate/bound in the future, because it relaxes it’s goals.
I’d like to see what eventually happens for this research direction:
Could we reliably give heuristic arguments for neural networks when proofs failed, or is it too hard to provide relevant arguments?
I do want to say that on formal verification/proof itself, I think the most useful application is not proving non-trivial things, but rather to keep ourselves honest about the assumptions we are using.
I think that AI people that are very concerned about AI risk tend to view loss of control risk as very high, while eternal authoritarianism risks are much lower.
I’m not sure how many people see the risk of eternal authoritarianism as much lower and how many people see it as being suppressed by the higher probability of loss of control[1]. Or in Bayesian terms:
P(eternal authoritarianism) = P(eternal authoritarianism | control is maintained) ⋅ P(control is maintained)
Both sides may agree that P(eternal authoritarianism | control is maintained) is high, only disagreeing on P(control is maintained).
Yeah, from a more technical perspective, I forgot to add that condition where loss of control is maintained or removed in the short/long run as an important variable to track.
Some thoughts on the post:
I think one very, very crucial crux that seems valuable to be unearthed more is whether loss of control risk is higher/more dangerous than ~eternal authoritarianism.
While Dario Amodei does have a 5-25% chance of p(Doom), I suspect his chances of p(Eternal Authoritarianism) is closer to 50-75% conditional on no intervention to change that, and I think that AI people that are very concerned about AI risk tend to view loss of control risk as very high, while eternal authoritarianism risks are much lower.
On the narrower subclass of the alignment problem you mentioned:
I think that the solutions that companies use like RLHF are mostly bad, but I’m not going to say that it’s maximally bad, because while it does violate the principle that alignment data should come during or before capabilities data, as it’s a post-training operation, I do think that in the most easy worlds, we come out fine despite not having much dignity.
Also, on the paper on AI systems being deceptive, while I like the paper for what it is, it doesn’t nearly give as much evidence on AI risk, because looking at the examples, 2 of the examples were cases where the AI companies didn’t even try to prevent deceptive strategies, the GPT-4 examples are capability evaluations, not alignment evaluations (capability evaluations are very good, I just think that a lot of the press is going to predictably mislead people), 1 of the examples are from something where I suspect they are doing simulated evolution, which is of no relevance to AGI, 1 example could plausibly be solved by capabilities people alone, and only 2 examples give any evidence for the claim that AI will be deceptive at all.
Overall, this doesn’t support the claim that you made here.
There is other work like Pretraining from Human Feedback which IMO is both conceptually and empirically better than RLHF, mostly because it both integrates feedback into training, and prevents the failure mode of a very capable AI gaming/reward hacking the training process as described here:
https://www.lesswrong.com/posts/NG6FrXgmqPd5Wn3mh/trying-to-disambiguate-different-questions-about-whether#pWBwJ3ysxnwn9Czyd
https://www.lesswrong.com/posts/JviYwAk5AfBR7HhEn/how-to-control-an-llm-s-behavior-why-my-p-doom-went-down-1
I’ll respond to the challenge described here:
As someone not affiliated with big tech companies, my plans for alignment look like the comments/links down below, since I’ve sketched a vision for how we’d align an AGI that is essentially an anytime plan in the spirit of @ryan_greenblatt’s post on what an alignment plan would look like if transformative AI, also called AGI, was developed in 1 year, say:
https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform#W4kdqMERtbL7Nfvfo
The summary of the alignment plan is that what you’d do is first gather up large synthetic datasets or make them up yourself on demand that show human values, and then use it as either a direct data input into the hybrid LLM AGI before it gains so much capabilities that it can hack the human evaluator, or use the synthetic datasets to define an explicit reward function in the style of Model-based RL.
The key principle is to train it to be aligned before or during when it becomes more capable, not after it’s superhumanly capable, which means post-training methods like RLHF are not a good alignment strategy.
The more complicated story will be in a series of links below:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/#BxNLNXhpGhxzm7heg
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#7bvmdfhzfdThZ6qck
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=DgLC43S7PgMuC878j
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#7bvmdfhzfdThZ6qck
One important point often left out of alignment discussions, but what I consider necessary to raise, is that an AI’s motivations as we progress to superintelligence should be more based around fictional characters like angels, devas, superheroic AIs, totemic spirits, matron deities etc, primarily because human control of AI can get very scary very fast, and I think that the problem of technical control is less important nowadays than the problem of human people controlling AIs and creating terrible equlibriums for everyone else while they profit.
More links incoming:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
https://www.lesswrong.com/posts/RCoBrvWfBMzGWNZ4t/motivating-alignment-of-llm-powered-agents-easy-for-agi-hard#Better_Role_Models_for_Aligned_LLM_Agents
It shares a lot with the bitter lesson approach to AI alignment advocated by @RogerDearnaley, but with more of a focus on data and compute, and less focus on the complexities of how humans mechanistically instantiate values:
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1#A__Bitter_Lesson__Motivated_Approach_to_Alignment
Now that I’m done presenting my own plan, I want to address 1 more issue.
I have issues with this:
I feel like this is surveying alignment/control techniques with too high of an abstraction level to be useful, and just as there is an existential quantifier on AI being dangerous, there is also an existential quantifier in alignment/control methods such that if 1 technique worked, we be way, way safer from AI risk.
More generally, my response to this section is that just as there is a garden of forking paths for AI capabilities, where only 1 approach out of many needs to work, a similar garden of forking paths exist for AI alignment where only a few approaches need to work, no matter how many methods don’t work in order for AI safety to be achieved:
https://gwern.net/forking-path
And that’s the end of the comment.
Thanks Noosphere89 for your long and thoughtful comment! I don’t have time to respond to everything before putting my 1-year-old to bed, but here are some brief comments.
1) Although I appreciate that you wrote out a proposed AGI alignment plan, I think you’ll agree that it contains no theorems or proofs, or even quantitative risk bounds. Since we insist on quantitative risk bounds before allowing much less dangerous technology such as airplanes and nuclear reactors, my view is that it would be crazy to launch AGI without quantitative risk bounds—especially when you’re dealing with a super-human mind that might actively optimize against vulnerabilities of the alignment system. As you know, rigorously ensuring retained alignment under recursive self-improvement is extremely difficult. For example, MIRI had highly talented researchers work on this for many years without completing the tast.
2) The point you make about fear of 1984 vs fear of extinction. However, if someone assicns P(1984) >> P(extinction) and there’s no convincing plan for preventing AGI loss-of-control, then I’d argue that it’s still crazy of them (or for China) to build AGI. So they’d both forge ahead with increasingly powerful yet controllable tool AGI, presumably remaining in a today’s mutually-asssured destruction paradign where neither has an incentive to try to conquer the other.
I have yet to hear a version of the “but China!” argument that makes any sense if you believe that the AGI race is a suicide race rather than a traditional armsrace. Those I hear making it are usually people who also dismiss the AGI extinction risk. If anything, the current Chinese leadership seems more concerned about AI xrisk than Western leaders.
I can wait for your response, so don’t take this as meaning you need to respond immediately, but I do have some comments.
After you are done with everything, I invite you to respond to this comment.
In response to 1, I definitely didn’t show quantitative risk bounds for my proposal, for a couple of reasons:
1, my alignment proposal would require a lot more work and concreteness than I was able to do, and my goal was to make an alignment proposal that was concrete enough for other people to fill in the details of how it could actually be done.
Then again, that’s why they are paid the big bucks and not me.
2, I am both much more skeptical of formal proof/verification for AGI safety than you are, and also believe that it is unnecessary to do formal proofs to get high confidence in an alignment plan working (though I do think that formal proof may, emphasis on may be a thing that is useful for AI control metastrategies).
For an example, I currently consider the Provably Safe AI agenda by Steve Omohundro and Ben Goldhaber at this time to be far too ambitious, and the biggest issue IMO is that the things they are promising rely on being able to quantify over all higher order behaviors that a system doesn’t have, which is out of the range of currently extrapolated formalization techniques, and Zach Hatfield Dodds and Ben Goldhaber bet on whether 3 locks that couldn’t be illegitimately unlocked could be designed by formal proof, where Zach Hatfield Dodds bet no, and Ben Goldhaber said yes, and the bet will resolve in 2027.
See these links for more:
https://www.lesswrong.com/posts/B2bg677TaS4cmDPzL/limitations-on-formal-verification-for-ai-safety#kPRnieFrEEifZjksa
https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects#Ku3X4QDBSyZhrtxkM
https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects#jjFsFmLbKNtMRyttK
https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects#Ght9hffumLkjxxNaw
I support more use of quantitative risk estimation in general, and would plausibly support a policy on forcing AI developers to estimate that their AI say has less than a 1% chance of ending the world, but I don’t think it’s crazy to not use quantitative formal proofs of AI alignment/control at this stage, and instead argue for more swiss-cheese style safety.
Another thing that influences me is I basically make 0 update from MIRI failing to solve the AI alignment problem as a sign that other groups will fail, mostly because I think they made far less progress than basically every other group, to the point where I think that Pretraining from Human Feedback made more progress on the alignment problem than basically all of MIRI’s work and their plans were IMO fairly doomed even in a hypothetical world where alignment is easy, since they restrained their techniques too much and didn’t touch reality at all.
So I disagree with this being a substantial update:
On this claim specifically:
I agree that this is not an argument for AI companies to race to AGI, but I consider the evidence for China being more concerned than the West from your article to be reasonably weak evidence, and I think that this could plausibly not convince someone who’s p(1984) is >> p(extinction) for this reason.
When it comes to formal verification I’m curious what you think about the heuristic argument line of research that ARC are approaching?:
https://www.lesswrong.com/posts/QA3cmgNtNriMpxQgo/research-update-towards-a-law-of-iterated-expectations-for
It isn’t formal verification in the same sense of the word but rather probabilistic verification if that makes sense?
You could then apply something like control theory methods to ensure that the expected divergence from the heuristic is less than a certain percentage in different places. In the limit it seems to me that this could be convergent towards formal verification proofs, it’s almost like swiss cheese style on the model level?
(Yes, this comment is a bit random with respect to the rest of the context but I find it an interesting question for control in terms of formal verification and it seemed like you might have some interesting takes here.)
You know what, I’ve identified a scenario where formal verification is both tractable and helps reduce AI risk, and the broad answer is making our codebases way more secure and controllable, assuming heavy AI automation of mathematics and coding is achieved (which I expect to happen before they can do everything else, as it’s a domain with strong feedback loops, easy verification against ground truth possible, and you can get very large amounts of data to continually improve.)
Here are the links below:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-hopium-wars-the-agi-entente-delusion#sgH9iCyon55yQyDqF
I’ve also discussed more about the value and limits of formal proofs in another comment below, but short answer, it’s probably really helpful in an infrastructure sense, but not so much as a means to make anything other than software and mathematics safe and formally specified (which would be a huge win if we could do that in itself), but we will not be able to prove stuff like a piece of software isn’t a threat to something else in the world entirely, and that also applies to biology in say determining whether a gene or virus will harm humans, mostly because we don’t have a path to quantify all possible higher-order behaviors a system doesn’t have:
https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-hopium-wars-the-agi-entente-delusion#2GDjfZTJ8AZrh9i7Y
My take on this is I’d be interested to see how the research goes, and there may be value in doing this approach, and I think that this may be a useful way to get a quantitative estimate/bound in the future, because it relaxes it’s goals.
I’d like to see what eventually happens for this research direction:
Could we reliably give heuristic arguments for neural networks when proofs failed, or is it too hard to provide relevant arguments?
I do want to say that on formal verification/proof itself, I think the most useful application is not proving non-trivial things, but rather to keep ourselves honest about the assumptions we are using.
I’m not sure how many people see the risk of eternal authoritarianism as much lower and how many people see it as being suppressed by the higher probability of loss of control[1]. Or in Bayesian terms:
P(eternal authoritarianism) = P(eternal authoritarianism | control is maintained) ⋅ P(control is maintained)
Both sides may agree that P(eternal authoritarianism | control is maintained) is high, only disagreeing on P(control is maintained).
Here, ‘control’ is short for all forms of ensuring AI alignment to humans, whether all or some or one.
Yeah, from a more technical perspective, I forgot to add that condition where loss of control is maintained or removed in the short/long run as an important variable to track.