I think this report is is overall pretty good! I do feel overall confused about whether the kind of thing this report is trying to do works. Here is a long comment I had left on a Google Doc version of this post that captures some of that:
I keep coming back to this document and keep feeling confused whether I expect this to work or provide value.
I feel like the framing here tries to shove a huge amount of complexity and science into a “safety case”, and then the structure of a “safety case” doesn’t feel to me like it is the kind of thing that would help me think about the very difficult and messy questions at hand.
Like, I understand that we would like to somehow force labs to make some proactive case for why their systems are safe, and we want to set things up in a way that they can’t just punt on the difficult questions, or avoid making a coherent argument, or say a bunch of unfalsifiable things, but I somehow don’t think that the way you try to carve the structure of a “safety case” is sufficient here, or seems helpful for actually helping someone think through this.
To be clear, the factorization here seems reasonable and seems like the kind of thing that seems helpful for helping people think about alignment, but it also feels more like it just captures “the state of fashionable AI safety thinking in 2024″ more than it is the kind of thing that makes sense to enshrine into a whole methodology (or maybe even regulation).
The emphasis on distinction between capability-evals and alignment-evals for example is a thing that mostly has gotten lots of emphasis in the last year, and I expect it won’t be as emphasized in a few years, and for us to have new distinctions we care a lot about.
Like, I feel like, I don’t have much of any traction on making an air-tight case that any system is safe, or is unsafe, and I feel like this document is framed in a kind of “oh, it’s easy, just draw some boxes and arrows between them and convince the evaluators you are right” way, when like, of course labs will fail to make a comprehensive safety case here.
But I feel like that’s mostly just a feature of the methodology, not a feature of the territory. Like, if you applied the same methodology to computer security, or financial fraud, or any other highly complex domain you would end up with the same situation where making any airtight case is really hard.
And I am worried the reason why this document is structures this way is because we, as people who are concerned about x-risk, like a playing field that is skewed towards AI development being hard to do, but this feels slightly underhanded.
But also, I don’t know, I like a lot of the distinctions in this document and found a bunch of the abstractions useful. But I feel like I would much prefer a document of the type of “here are some ways I found helpful for thinking about AI X-risk” or something like that.
Thanks for leaving this comment on the doc and posting it.
But I feel like that’s mostly just a feature of the methodology, not a feature of the territory. Like, if you applied the same methodology to computer security, or financial fraud, or any other highly complex domain you would end up with the same situation where making any airtight case is really hard.
You are right in that safety cases are not typically applied to security. Some of the reasons for this are explained in this paper, but I think the main reason is this:
“The obvious difference between safety and security is the presence of an intelligent adversary; as Anderson (Anderson 2008) puts it, safety deals with Murphy’s Law while security deals with Satan’s Law… Security has to deal with agents whose goal is to compromise systems.”
My guess is that most safety evidence will come down to claims like “smart people tried really hard to find a way things could go wrong and couldn’t.” This is part of why I think ‘risk cases’ are very important.
I share the intuitions behind some of your other reactions.
I feel like the framing here tries to shove a huge amount of complexity and science into a “safety case”, and then the structure of a “safety case” doesn’t feel to me like it is the kind of thing that would help me think about the very difficult and messy questions at hand.
Making safety cases is probably hard, but I expect explicitly enumerating their claims and assumptions would be quite clarifying. To be clear, I’m not claiming that decision-makers should only communicate via safety and risk cases. But I think that relying on less formal discussion would be significantly worse.
Part of why I think this is that my intuitions have been wrong over and over again. I’ve often figured this out after eventually asking myself “what claims and assumptions am I making? How confident am I these claims are correct?”
it also feels more like it just captures “the state of fashionable AI safety thinking in 2024” more than it is the kind of thing that makes sense to enshrine into a whole methodology.
To be clear, the methodology is separate from the enumeration of arguments (which I probably could have done a better job signaling in the paper). The safety cases + risk cases methodology shouldn’t depend too much on what arguments are fashionable at the moment.
I agree that the arguments will evolve to some extent in the coming years. I’m more optimistic about the robustness of the categorization, but that’s maybe minor.
I think this report is is overall pretty good! I do feel overall confused about whether the kind of thing this report is trying to do works. Here is a long comment I had left on a Google Doc version of this post that captures some of that:
Thanks for leaving this comment on the doc and posting it.
You are right in that safety cases are not typically applied to security. Some of the reasons for this are explained in this paper, but I think the main reason is this:
“The obvious difference between safety and security is the presence of an intelligent adversary; as Anderson (Anderson 2008) puts it, safety deals with Murphy’s Law while security deals with Satan’s Law… Security has to deal with agents whose goal is to compromise systems.”
My guess is that most safety evidence will come down to claims like “smart people tried really hard to find a way things could go wrong and couldn’t.” This is part of why I think ‘risk cases’ are very important.
I share the intuitions behind some of your other reactions.
Making safety cases is probably hard, but I expect explicitly enumerating their claims and assumptions would be quite clarifying. To be clear, I’m not claiming that decision-makers should only communicate via safety and risk cases. But I think that relying on less formal discussion would be significantly worse.
Part of why I think this is that my intuitions have been wrong over and over again. I’ve often figured this out after eventually asking myself “what claims and assumptions am I making? How confident am I these claims are correct?”
To be clear, the methodology is separate from the enumeration of arguments (which I probably could have done a better job signaling in the paper). The safety cases + risk cases methodology shouldn’t depend too much on what arguments are fashionable at the moment.
I agree that the arguments will evolve to some extent in the coming years. I’m more optimistic about the robustness of the categorization, but that’s maybe minor.