I’m mostly not going to comment on Anthropic’s RSP right now, since I don’t really want this post to become about Anthropic’s RSP in particular. I’m happy to talk in more detail about Anthropic’s RSP maybe in a separate top-level post dedicated to it, but I’d prefer to keep the discussion here focused on RSPs in general.
One of my main worries with RSPs is that they’ll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That’s much worse than if they were clearly inadequate.
I definitely share this worry. But that’s part of why I’m writing this post! Because I think it is possible for us to get good RSPs from all the labs and governments, but it’ll take good policy and advocacy work to make that happen.
My main worry here isn’t that we’ll miss catastrophic capabilities in the near term (though it’s possible). Rather it’s the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there’s a decent chance some of them fail before we expect them to.
I agree that this is a serious concern, though I think that at least in the case of capabilities evaluations, it should be solvable. Though it’ll require those capabilities evaluations to actually be done effectively, I think we at least do know how to do effective capabilities evaluations—it’s mostly a solved problem in theory and just requires good implementation.
We need governments to make them mandatory before they’re necessary, not once we have them (NB, not [before it’s clear they’re necessary] - it might not be clear). I don’t expect us to have sufficiently accurate understanding-based evals before they’re necessary. (though it’d be lovely)
Pushing to require state-of-the-art safety techniques is the wrong emphasis.
We need to push for adequate safety techniques. If state-of-the-art techniques aren’t yet adequate, then labs need to stop.
The distinction between an alignment technique and an alignment evaluation is very important here: I very much am trying to push for adequate safety techniques rather than simply state-of-the-art safety techniques, and the way I’m proposing we do that is via evaluations that check whether we understand our models. What I think probably needs to happen before you can put understanding-based evals in an RSP is not that we have to solve mechanistic interpretability—it’s that we have to solve understanding-based evals. That is, we need to know how to evaluate whether mechanistic interpretability has been solved or not. My concern with trying to put something like that into an RSP right now is that it’ll end up evaluating the wrong thing: since we don’t yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
I think we at least do know how to do effective capabilities evaluations
This seems an overstatement to me: Where the main risk is misuse, we’d need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)
It seems reasonable to me to claim that “we know how to do effective [capabilities given sota elicitation methods] evaluations”, but that doesn’t answer the right question.
Once the main risk isn’t misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn’t realize we were relying upon]). Obviously we don’t expect these to break yet, but I’d guess that we’ll be surprised the first time they do break. I expect your guess on when they will break to be more accurate than mine—but that [I don’t have much of a clue, so I’m advocating extreme caution] may be the more reasonable policy.
My concern with trying to put something like [understanding-based evals] into an RSP right now is that it’ll end up evaluating the wrong thing: since we don’t yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
We don’t know how to put the concrete eval in the RSP, but we can certainly require that an eval for understanding passes. We can write in the RSP what the test would be intended to achieve, and conditions for the approval of the eval. E.g. [if at least two of David Krueger, Wei Dai and Abram Demski agree that this meets the bar for this category of understanding eval, then it does] (or whatever other criteria you might want).
Again, only putting targets that are well understood concretely in the RSP seems like a predictable way to fail to address poorly understood problems. Either the RSP needs to cover the poorly understood problems too—perhaps with a [you can’t pass this check without first coming up with a test and getting it approved] condition, or it needs a “THIS RSP IS INADEQUATE TO ENSURE SAFETY” warning in huge red letters on every page. (if the Anthropic RSP communicates this at all, it’s not emphasized nearly enough)
I’m mostly not going to comment on Anthropic’s RSP right now, since I don’t really want this post to become about Anthropic’s RSP in particular. I’m happy to talk in more detail about Anthropic’s RSP maybe in a separate top-level post dedicated to it, but I’d prefer to keep the discussion here focused on RSPs in general.
I definitely share this worry. But that’s part of why I’m writing this post! Because I think it is possible for us to get good RSPs from all the labs and governments, but it’ll take good policy and advocacy work to make that happen.
I agree that this is a serious concern, though I think that at least in the case of capabilities evaluations, it should be solvable. Though it’ll require those capabilities evaluations to actually be done effectively, I think we at least do know how to do effective capabilities evaluations—it’s mostly a solved problem in theory and just requires good implementation.
The distinction between an alignment technique and an alignment evaluation is very important here: I very much am trying to push for adequate safety techniques rather than simply state-of-the-art safety techniques, and the way I’m proposing we do that is via evaluations that check whether we understand our models. What I think probably needs to happen before you can put understanding-based evals in an RSP is not that we have to solve mechanistic interpretability—it’s that we have to solve understanding-based evals. That is, we need to know how to evaluate whether mechanistic interpretability has been solved or not. My concern with trying to put something like that into an RSP right now is that it’ll end up evaluating the wrong thing: since we don’t yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
This seems an overstatement to me:
Where the main risk is misuse, we’d need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)
It seems reasonable to me to claim that “we know how to do effective [capabilities given sota elicitation methods] evaluations”, but that doesn’t answer the right question.
Once the main risk isn’t misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn’t realize we were relying upon]). Obviously we don’t expect these to break yet, but I’d guess that we’ll be surprised the first time they do break.
I expect your guess on when they will break to be more accurate than mine—but that [I don’t have much of a clue, so I’m advocating extreme caution] may be the more reasonable policy.
We don’t know how to put the concrete eval in the RSP, but we can certainly require that an eval for understanding passes. We can write in the RSP what the test would be intended to achieve, and conditions for the approval of the eval. E.g. [if at least two of David Krueger, Wei Dai and Abram Demski agree that this meets the bar for this category of understanding eval, then it does] (or whatever other criteria you might want).
Again, only putting targets that are well understood concretely in the RSP seems like a predictable way to fail to address poorly understood problems.
Either the RSP needs to cover the poorly understood problems too—perhaps with a [you can’t pass this check without first coming up with a test and getting it approved] condition, or it needs a “THIS RSP IS INADEQUATE TO ENSURE SAFETY” warning in huge red letters on every page. (if the Anthropic RSP communicates this at all, it’s not emphasized nearly enough)