Some thoughts on the prevalence of alignment over control approaches in AI Safety.
“Alignment research” has become loosely synonymous with “AI Safety research”. I don’t know if any researcher who would state they’re identical, but alignment seems to be considered the default AI Safety strategy. This seems problematic, may be causing some mild group-think and discourages people from pursuing non-alignment AI Safety agendas.
Prosaic alignment research in the short term results in a better product, and makes investment into AI more lucrative. This shortens timelines and, I claim, increases X-risk. Christiano’s work on RLHF[1] was clearly motivated by AI Safety concerns, but the resulting technique was then used to improve the performance of today’s models.[2] Meanwhile, there are strong arguments that RLHF is insufficient to solve (existential) AI Safety.[3][4]
Better theoretical understanding about how to control an unaligned (or partially aligned) AI facilitates better governance strategies. There is no physical law that means the compute threshold mentioned in SB 1047[5] would have been appropriate. “Scaling Laws” are purely empirical and may not hold in the long term. In addition, we should anticipate rapid technological change as capabilities companies leverage AI assisted research (and ultimately fully autonomous researchers). If we want laws which can reliably control AI, we need stronger theoretical foundations for our control approaches.
Solving alignment is only sufficient for mitigating the majority of harm caused by AI if the AI is aligned to appropriate goals. To me, it appears the community is incredibly trusting in the goodwill of the small group of private individuals who will determine the alignment of an ASI. [6] My understanding of the position held by both Yudkowsky and Christiano is that AI alignment is a very difficult problem and the probability is high that humanity is completely paperclipped. Outcomes in which a small portion of humanity survive are difficult to achieve[7] and human controlled dystopia’s are better than annihilation[8]. I do not believe this is not representative of the views of the majority of humans sharing this planet. Human history shows that many people are willing to risk death to preserve the autonomy of themselves, their nation or cultural group[9]. Dystopias also carry substantial S-risk for those unlucky enough to live through them, with modern history containing multiple examples of widespread, industrialised torture. Control approaches present an alternative in which we do not have to get it right the first time, may avert a dystopia and do not need to trust a small group of humans to behave morally.
If control approaches aid governance, alignment research makes contemporary systems more profitable and the majority of AI Safety research is being done at capabilities companies, it shouldn’t be surprising that the focus remains on alignment.
See also discussion in this comment chain on Cotra’s Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (2022)
“all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’”— Yudkowksy, AGI Ruin: A List of Lethalities (2022)
I agree with number 1 pretty totally, and think the conflation of AI safety and AI alignment is a pretty large problem in the AI safety field, driven IMO mostly by LessWrong, which birthed the AI safety community and still has significant influence over it.
I disagree with this important claim on bullet point 2:
I claim, increases X-risk
primarily because I believe the evidential weight of “negative-to low tax alignment strategies are possible” outweighs the shortening of timelines effects, cf Pretraining from Human Feedback which applies it in training, and is IMO conceptually better than RLHF primarily because it avoids the dangerous possibility of persuading the humans to give it high reward for dangerous actions.
Also, while there are reasonable arguments that RLHF isn’t a full solution to the alignment problem, I do think it’s worth pointing out that when it breaks down is important, and it also shows us a proof of concept of how a solution to the alignment problem may have low or negative taxes.
Agree with bullet point 3, though more aligned AI models also help with governance.
Also, Chinchilla scaling laws have been proven to be optimal in this paper, though I’m not sure what assumptions they are making, since I haven’t read the paper.
For bullet point 4, I think this is basically correct:
Solving alignment is only sufficient for mitigating the majority of harm caused by AI if the AI is aligned to appropriate goals. To me, it appears the community is incredibly trusting in the goodwill of the small group of private individuals who will determine the alignment of an ASI.
I think part of the issue is that people often imagined an AI would be for good or ill uncontrollable by everyone, and would near-instantaneously rule the world, such that what values it has matters more than anything else, including politics, and I remember Eliezer stating that he didn’t want useless fighting over politics because the AI aimability/control problem was far more important, which morphed into the alignment problem.
Here’s the quote:
Why doesn’t the Less Wrong / AI risk community have good terminology for the right hand side of the diagram? Well, this (I think) goes back to a decision by Eliezer from the SL4 mailing list days that one should not discuss what the world would be like after the singularity, because a lot of time would be wasted arguing about politics, instead of the then more urgent problem of solving the AI Aimability problem (which was then called the control problem). At the time this decision was probably correct, but times have changed. There are now quite a few people working on Aimability, and far more are surely to come, and it also seems quite likely (though not certain) that Eliezer was wrong about how hard Aimability/Control actually is.
My understanding of the position held by both Yudkowsky and Christiano is that AI alignment is a very difficult problem and the probability is high that humanity is completely paperclipped. Outcomes in which a small portion of humanity survive are difficult to achieve and human controlled dystopia’s are better than annihilation.
I do not believe this is not representative of the views of the majority of humans sharing this planet. Human history shows that many people are willing to risk death to preserve the autonomy of themselves, their nation or cultural group. Dystopias also carry substantial S-risk for those unlucky enough to live through them, with modern history containing multiple examples of widespread, industrialised torture. Control approaches present an alternative in which we do not have to get it right the first time, may avert a dystopia and do not need to trust a small group of humans to behave morally.
I basically agree with this, and I also think that the S-risk probability of aligned AIs to their owners causing severe dystopias that engage in extraordinarily horrific stuff like torture to be higher than the X-risk probability of AI misalignment causing a disaster, due to both my optimism on AI alignment being likely to be achieved by default, combined with me being somewhat pessimistic on governance in the AGI transition, and in my value system, it is way better to risk a small chance of death than a small chance of torture.
However, I disagree with the assumption that AI control differentially averts dystopia/trusting a small group of humans to behave morally, because I agree with Charlie Steiner on the dual-useness of safety research in general, and think that control approaches that let you control superhuman AI safely with high probability also let you create a dystopia.
This is an instance of a more general trade-off between reducing AI misalignment risk vs reducing AI misuse/structural risks.
I disagree with bullet point 5, and believe the reason why the focus on alignment exists is basically because of LW, where a lot of LWers believe the alignment problem is harder to solve and more urgent than the misuse problem, and I believe both control and alignment make AI systems more profitable to build and aids governance, and this existed even in the very early days of LW.
Control also makes AI more profitable, and more attractive to human tyrants, in worlds where control is useful. People want to know they can extract useful work from the AIs they build, and if problems with deceptiveness (or whatever control-focused people think the main problem is) are predictable, it will be more profitable, and lead to more powerful AU getting used, if there are control measures ready to hand.
This isn’t a knock-down argument against anything, it’s just pointing out that inherent dual use of safety research is pretty broad—I suspect it’s less obvious for AI control simply because AI control hasn’t been useful for safety yet.
Inspired by Mark Xu’s Quick Take on control.
Some thoughts on the prevalence of alignment over control approaches in AI Safety.
“Alignment research” has become loosely synonymous with “AI Safety research”. I don’t know if any researcher who would state they’re identical, but alignment seems to be considered the default AI Safety strategy. This seems problematic, may be causing some mild group-think and discourages people from pursuing non-alignment AI Safety agendas.
Prosaic alignment research in the short term results in a better product, and makes investment into AI more lucrative. This shortens timelines and, I claim, increases X-risk. Christiano’s work on RLHF[1] was clearly motivated by AI Safety concerns, but the resulting technique was then used to improve the performance of today’s models.[2] Meanwhile, there are strong arguments that RLHF is insufficient to solve (existential) AI Safety.[3][4]
Better theoretical understanding about how to control an unaligned (or partially aligned) AI facilitates better governance strategies. There is no physical law that means the compute threshold mentioned in SB 1047[5] would have been appropriate. “Scaling Laws” are purely empirical and may not hold in the long term.
In addition, we should anticipate rapid technological change as capabilities companies leverage AI assisted research (and ultimately fully autonomous researchers). If we want laws which can reliably control AI, we need stronger theoretical foundations for our control approaches.
Solving alignment is only sufficient for mitigating the majority of harm caused by AI if the AI is aligned to appropriate goals. To me, it appears the community is incredibly trusting in the goodwill of the small group of private individuals who will determine the alignment of an ASI. [6]
My understanding of the position held by both Yudkowsky and Christiano is that AI alignment is a very difficult problem and the probability is high that humanity is completely paperclipped. Outcomes in which a small portion of humanity survive are difficult to achieve[7] and human controlled dystopia’s are better than annihilation[8].
I do not believe this is not representative of the views of the majority of humans sharing this planet. Human history shows that many people are willing to risk death to preserve the autonomy of themselves, their nation or cultural group[9]. Dystopias also carry substantial S-risk for those unlucky enough to live through them, with modern history containing multiple examples of widespread, industrialised torture.
Control approaches present an alternative in which we do not have to get it right the first time, may avert a dystopia and do not need to trust a small group of humans to behave morally.
If control approaches aid governance, alignment research makes contemporary systems more profitable and the majority of AI Safety research is being done at capabilities companies, it shouldn’t be surprising that the focus remains on alignment.
https://openai.com/index/learning-from-human-preferences/
https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/
An argument for RLHF being problematic is made formally by Ngo, Chan and Minderman (2022).
See also discussion in this comment chain on Cotra’s Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (2022)
models trained using 1026 integer or floating-point operations.
https://www.lesswrong.com/posts/DJRe5obJd7kqCkvRr/don-t-leave-your-fingerprints-on-the-future
“all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’”—
Yudkowksy, AGI Ruin: A List of Lethalities (2022)
“Sure, maybe. But that’s still better than a paperclip maximizer killing us all.”—Christiano on the prospect of dystopian outcomes.
There is no global polling on this exact question. Consider that people across cultures have proclaimed that they’d prefer death to a life without freedom. See also men who committed suicide rather than face a lifetime of slavery.
My views on your bullet points:
I agree with number 1 pretty totally, and think the conflation of AI safety and AI alignment is a pretty large problem in the AI safety field, driven IMO mostly by LessWrong, which birthed the AI safety community and still has significant influence over it.
I disagree with this important claim on bullet point 2:
primarily because I believe the evidential weight of “negative-to low tax alignment strategies are possible” outweighs the shortening of timelines effects, cf Pretraining from Human Feedback which applies it in training, and is IMO conceptually better than RLHF primarily because it avoids the dangerous possibility of persuading the humans to give it high reward for dangerous actions.
Also, while there are reasonable arguments that RLHF isn’t a full solution to the alignment problem, I do think it’s worth pointing out that when it breaks down is important, and it also shows us a proof of concept of how a solution to the alignment problem may have low or negative taxes.
Link below:
https://www.lesswrong.com/posts/xhLopzaJHtdkz9siQ/the-case-for-a-negative-alignment-tax#The_elephant_in_the_lab__RLHF
Agree with bullet point 3, though more aligned AI models also help with governance.
Also, Chinchilla scaling laws have been proven to be optimal in this paper, though I’m not sure what assumptions they are making, since I haven’t read the paper.
https://arxiv.org/abs/2410.01243
For bullet point 4, I think this is basically correct:
I think part of the issue is that people often imagined an AI would be for good or ill uncontrollable by everyone, and would near-instantaneously rule the world, such that what values it has matters more than anything else, including politics, and I remember Eliezer stating that he didn’t want useless fighting over politics because the AI aimability/control problem was far more important, which morphed into the alignment problem.
Here’s the quote:
Link is below:
https://www.lesswrong.com/posts/sy4whuaczvLsn9PNc/ai-alignment-is-a-dangerously-overloaded-term
I basically agree with this, and I also think that the S-risk probability of aligned AIs to their owners causing severe dystopias that engage in extraordinarily horrific stuff like torture to be higher than the X-risk probability of AI misalignment causing a disaster, due to both my optimism on AI alignment being likely to be achieved by default, combined with me being somewhat pessimistic on governance in the AGI transition, and in my value system, it is way better to risk a small chance of death than a small chance of torture.
However, I disagree with the assumption that AI control differentially averts dystopia/trusting a small group of humans to behave morally, because I agree with Charlie Steiner on the dual-useness of safety research in general, and think that control approaches that let you control superhuman AI safely with high probability also let you create a dystopia.
This is an instance of a more general trade-off between reducing AI misalignment risk vs reducing AI misuse/structural risks.
I disagree with bullet point 5, and believe the reason why the focus on alignment exists is basically because of LW, where a lot of LWers believe the alignment problem is harder to solve and more urgent than the misuse problem, and I believe both control and alignment make AI systems more profitable to build and aids governance, and this existed even in the very early days of LW.
Control also makes AI more profitable, and more attractive to human tyrants, in worlds where control is useful. People want to know they can extract useful work from the AIs they build, and if problems with deceptiveness (or whatever control-focused people think the main problem is) are predictable, it will be more profitable, and lead to more powerful AU getting used, if there are control measures ready to hand.
This isn’t a knock-down argument against anything, it’s just pointing out that inherent dual use of safety research is pretty broad—I suspect it’s less obvious for AI control simply because AI control hasn’t been useful for safety yet.