I touched upon this idea indirectly in the original post when discussing alignment-related High Impact Tasks (HITs), but I didn’t explicitly connect it to the potential for reducing implementation costs and you’re right to point that out.
Let me clarify how the framework handles this aspect and elaborate on its implications.
Key points:
Alignment-related HITs, such as automating oversight or interpretability research, introduce challenges and make the HITs more complicated. We need to ask, what’s the difficulty of aligning a system capable of automating the alignment of systems capable of achieving HITs!
The HIT framing is flexible enough to accommodate the use of AI for accelerating alignment research, not just for directly reducing existential risk. If full alignment automation of systems capable of performing (non alignment related) HITs is construed as an HIT, the actual alignment difficulty corresponds to the level required to align the AI system performing the automation, not the automated task itself.
In practice, a combination of AI systems at various alignment difficulty levels will likely be employed to reduce costs and risks for both alignment-related tasks and other applications. Partial automation and acceleration by AI systems can significantly impact the cost curve for implementing advanced alignment techniques, even if full automation is not possible.
The cost curve presented in the original post assumes no AI assistance, but in reality, AI involvement in alignment research could substantially alter its shape. This is because the cost curve covers the cost of performing research “to achieve the given HITs”, and since substantially automating alignment research is a possible HIT, by definition the cost graph is not supposed to include substantial assistance on alignment research.
However, that makes it unrealistic in practice, especially because (as indicated by the haziness on the graph), there will be many HITs, both accelerating alignment research and also incrementally reducing overall risk.
To illustrate, consider a scenario where scalable oversight at level 4 of the alignment difficulty scale is used to fully automate mechninterp at level 6, and then this level 6 system can go on to say research a method of impregnable cyber-defense, rapid counter-bio-weapon vaccines, give superhuman geopolitical strategic advice, and unmask any unaligned AI present on the internet.
In this case, the actual difficulty level would be 4, with the HIT being the automation of the level 6 technique that’s then used to reduce risk substantially.
On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.
In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.
In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren’t any harder to implement than earlier, simpler ones.
I touched upon this idea indirectly in the original post when discussing alignment-related High Impact Tasks (HITs), but I didn’t explicitly connect it to the potential for reducing implementation costs and you’re right to point that out.
Let me clarify how the framework handles this aspect and elaborate on its implications.
Key points:
Alignment-related HITs, such as automating oversight or interpretability research, introduce challenges and make the HITs more complicated. We need to ask, what’s the difficulty of aligning a system capable of automating the alignment of systems capable of achieving HITs!
The HIT framing is flexible enough to accommodate the use of AI for accelerating alignment research, not just for directly reducing existential risk. If full alignment automation of systems capable of performing (non alignment related) HITs is construed as an HIT, the actual alignment difficulty corresponds to the level required to align the AI system performing the automation, not the automated task itself.
In practice, a combination of AI systems at various alignment difficulty levels will likely be employed to reduce costs and risks for both alignment-related tasks and other applications. Partial automation and acceleration by AI systems can significantly impact the cost curve for implementing advanced alignment techniques, even if full automation is not possible.
The cost curve presented in the original post assumes no AI assistance, but in reality, AI involvement in alignment research could substantially alter its shape. This is because the cost curve covers the cost of performing research “to achieve the given HITs”, and since substantially automating alignment research is a possible HIT, by definition the cost graph is not supposed to include substantial assistance on alignment research.
However, that makes it unrealistic in practice, especially because (as indicated by the haziness on the graph), there will be many HITs, both accelerating alignment research and also incrementally reducing overall risk.
To illustrate, consider a scenario where scalable oversight at level 4 of the alignment difficulty scale is used to fully automate mechninterp at level 6, and then this level 6 system can go on to say research a method of impregnable cyber-defense, rapid counter-bio-weapon vaccines, give superhuman geopolitical strategic advice, and unmask any unaligned AI present on the internet.
In this case, the actual difficulty level would be 4, with the HIT being the automation of the level 6 technique that’s then used to reduce risk substantially.
Thank you for the insightful comment.
On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.
In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.
In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren’t any harder to implement than earlier, simpler ones.