An intuition for safety cases for automated safety research over time
Safety cases—we want to be able to make a (conservative) argument for why a certain AI system won’t e.g. pose x-risk with probability > p / year. Rely on composing safety arguments / techniques into a ‘holistic case’.
Safety arguments are rated on three measures:
Practicality: ‘Could the argument be made soon or does it require substantial research progress?’
Maximum strength: ‘How much confidence could the argument give evaluators that the AI systems are safe?’
Scalability: ‘To what extent can the argument be applied to very powerful AI systems?’
Here’s an intuition of how to iteratively build safety cases for automated safety research, as AI systems become increasingly capable:
we want to ‘pile on’ / combine methods which are Practical enough already and not too alignment-tax-costly (e.g. assume a limited budget in e.g. [training and / or inference] compute and / or money and / or researchers), even though they might not be very Scalable (in the long run);
as time goes by and AIs become more capable, we should expect some of our methods’ Scalability to be ‘outpaced’ by the capabilities of new systems (e.g. maybe they’ll pass [some] DC evals → we won’t be able to rely on [some] inability arguments anymore);
we want to be able to use auto safety research to raise the Practicality of methods which are Scalable enough, but not Practical enough yet
e.g. unlearning, interpretability (see figure above)
we need our auto safety researchers to be [reasonably likely] capable of this, though.
Acknowledgments: early versions of this idea were developed during my participantion in the Astra Fellowship Winter ’24 under @evhub’s mentorship and benefitted from many conversations in Constellation.
An intuition for safety cases for automated safety research over time
Safety cases—we want to be able to make a (conservative) argument for why a certain AI system won’t e.g. pose x-risk with probability > p / year. Rely on composing safety arguments / techniques into a ‘holistic case’.
Safety arguments are rated on three measures:
Practicality: ‘Could the argument be made soon or does it require substantial research progress?’
Maximum strength: ‘How much confidence could the argument give evaluators that the AI systems are safe?’
Scalability: ‘To what extent can the argument be applied to very powerful AI systems?’
Examples of safety arguments (fig. from Safety Cases: How to Justify the Safety of Advanced AI Systems):
Here’s an intuition of how to iteratively build safety cases for automated safety research, as AI systems become increasingly capable:
we want to ‘pile on’ / combine methods which are Practical enough already and not too alignment-tax-costly (e.g. assume a limited budget in e.g. [training and / or inference] compute and / or money and / or researchers), even though they might not be very Scalable (in the long run);
as time goes by and AIs become more capable, we should expect some of our methods’ Scalability to be ‘outpaced’ by the capabilities of new systems (e.g. maybe they’ll pass [some] DC evals → we won’t be able to rely on [some] inability arguments anymore);
we want to be able to use auto safety research to raise the Practicality of methods which are Scalable enough, but not Practical enough yet
e.g. unlearning, interpretability (see figure above)
we need our auto safety researchers to be [reasonably likely] capable of this, though.
Acknowledgments: early versions of this idea were developed during my participantion in the Astra Fellowship Winter ’24 under @evhub’s mentorship and benefitted from many conversations in Constellation.