Bogdan Ionut Cirstea comments on Bogdan Ionut Cirstea’s Shortform

Bogdan Ionut Cirstea 20 Aug 2024 12:03 UTC
1 point
0
An intuition for safety cases for automated safety research over time
Safety cases—we want to be able to make a (conservative) argument for why a certain AI system won’t e.g. pose x-risk with probability > p / year. Rely on composing safety arguments / techniques into a ‘holistic case’.
Safety arguments are rated on three measures:
Practicality: ‘Could the argument be made soon or does it require substantial research progress?’
Maximum strength: ‘How much confidence could the argument give evaluators that the AI systems are safe?’
Scalability: ‘To what extent can the argument be applied to very powerful AI systems?’
Examples of safety arguments (fig. from Safety Cases: How to Justify the Safety of Advanced AI Systems):

Here’s an intuition of how to iteratively build safety cases for automated safety research, as AI systems become increasingly capable:
- we want to ‘pile on’ / combine methods which are Practical enough already and not too alignment-tax-costly (e.g. assume a limited budget in e.g. [training and / or inference] compute and / or money and / or researchers), even though they might not be very Scalable (in the long run);
- as time goes by and AIs become more capable, we should expect some of our methods’ Scalability to be ‘outpaced’ by the capabilities of new systems (e.g. maybe they’ll pass [some] DC evals → we won’t be able to rely on [some] inability arguments anymore);
- we want to be able to use auto safety research to raise the Practicality of methods which are Scalable enough, but not Practical enough yet
  - e.g. unlearning, interpretability (see figure above)
  - we need our auto safety researchers to be [reasonably likely] capable of this, though.
Acknowledgments: early versions of this idea were developed during my participantion in the Astra Fellowship Winter ’24 under @evhub’s mentorship and benefitted from many conversations in Constellation.
What links here?
- Bogdan Ionut Cirstea's comment on the case for CoT unfaithfulness is overstated by nostalgebraist (3 Oct 2024 13:08 UTC; 6 points)
- Bogdan Ionut Cirstea's comment on the case for CoT unfaithfulness is overstated by nostalgebraist (3 Nov 2024 14:57 UTC; 0 points)

Bogdan Ionut Cirstea comments on Bogdan Ionut Cirstea’s Shortform

An intuition for safety cases for automated safety research over time