Joe Carlsmith comments on AI for AI safety

Joe Carlsmith 7 Apr 2025 23:28 UTC
LW: 5 AF: 1
1
AF
Thanks, John. I’m going to hold off here on in-depth debate about how to choose between different ontologies in this vicinity, as I do think it’s often a complicated and not-obviously-very-useful thing to debate in the abstract, and that lots of taste is involved. I’ll flag, though, that the previous essay on paths and waystations (where I introduce this ontology in more detail) does explicitly name various of the factors you mention (along with a bunch of other not-included subtleties). E.g., re the importance of multiple actors:
Now: so far I’ve only been talking about one actor. But AI safety, famously, implicates many actors at once – actors that can have different safety ranges and capability frontiers, and that can make different development/deployment decisions. This means that even if one actor is adequately cautious, and adequately good at risk evaluation, another might not be...
And re: e.g. multidimensionality, and the difference between “can deploy safely” and “would in practice” -- from footnote 14:
Complexities I’m leaving out (or not making super salient) include: the multi-dimensionality of both the capability frontier and the safety range; the distinction between safety and elicitation; the distinction between development and deployment; the fact that even once an actor “can” develop a given type of AI capability safely, they can still choose an unsafe mode of development regardless; differing probabilities of risk (as opposed to just a single safety range); differing severities of rogue behavior (as opposed to just a single threshold for loss of control); the potential interactions between the risks created by different actors; the specific standards at stake in being “able” to do something safely; etc.
I played around with more complicated ontologies that included more of these complexities, but ended up deciding against. As ever, there are trade-offs between simplicity and subtlety, I chose a particular way of making those trade-offs, and so far I’m not regretting.
Re: who is risk-evaluating, how they’re getting the information, the specific decision-making processes: yep, the ontology doesn’t say, and I endorse that, I think trying to specify would be too much detail.
Re: why factor apart the capability frontier and the safety range—sure, they’re not independent, but it seems pretty natural to me to think of risk as increasing as frontier capabilities increase, and of our ability to make AIs safe as needing to keep up with that. Not sure I understand your alternative proposals re: “looking at their average and difference as the two degrees of freedom, or their average and difference in log space, or the danger line level and the difference, or...”, though, or how they would improve matters.
As I say, people have different tastes re: ontologies, simplifications, etc. My own taste finds this one fairly natural and useful—and I’m hoping that the use I give it in the rest of series (e.g., in classifying different waystations and strategies, in thinking about these different feedback loops, etc) can illustrate why (see also the slime analogy from the previous post for another intuition pump). But I welcome specific proposals for better overall ways of thinking about the issues in play.