Thanks for writing this up. I agree that the issue is important, though I’m skeptical of RSPs so far, since we have one example and it seems inadequate—to the extent that I’m positively disposed, it’s almost entirely down to personal encounters with Anthropic/ARC people, not least yourself. I find it hard to reconcile the thoughtfulness/understanding of the individuals with the tone/content of the Anthropic RSP. (of course I may be missing something in some cases)
Going only by the language in the blog post and the policy, I’d conclude that they’re an excuse to continue scaling while being respectably cautious (though not adequately cautious). Granted, I’m not the main target audience—but I worry about the impression the current wording creates.
I hope that RSPs can be beneficial—but I think much more emphasis should be on the need forpositive demonstration of safety properties, that this is not currently possible, and that it may take many years for that to change. (mentioned, but not emphasized in the Anthropic policy—and without any “many years” or similar)
It’s hard to summarize my concerns, so apologies if the following ends up somewhat redundant. I’ll focus on your post first, and the RSP blog/policy doc after that.
Governments are very good at solving “80% of the players have committed to safety standards but the remaining 20% are charging ahead recklessly” because the solution in that case is obvious and straightforward.
There’s an obvious thing to do here. It’s far from obvious that it’s a solution. One of my main worries with RSPs is that they’ll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That’s much worse than if they were clearly inadequate.
RSPs are clearly and legibly risk-based: they specifically kick in only when models have capabilities that are relevant to downstream risks.
They kick in when we detect that models have capabilities that we realize are relevant to downstream risks. Both detection and realization can fail.
My main worry here isn’t that we’ll miss catastrophic capabilities in the near term (though it’s possible). Rather it’s the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there’s a decent chance some of them fail before we expect them to.
Using understanding of models as the final hard gate is a condition that—if implemented correctly—is intuitively compelling and actually the thing we need to ensure safety.
This could use greater emphasis in the RSP blog/doc.
Ideally, we should get the governmental RSPs to be even stronger!
Yes!
We need to make sure that, once we have solid understanding-based evals, governments make them mandatory.
We need governments to make them mandatory before they’re necessary, not once we have them (NB, not [before it’s clear they’re necessary] - it might not be clear). I don’t expect us to have sufficiently accurate understanding-based evals before they’re necessary. (though it’d be lovely)
Pushing to require state-of-the-art safety techniques is the wrong emphasis. We need to push for adequate safety techniques. If state-of-the-art techniques aren’t yet adequate, then labs need to stop.
Thoughts on the blog/doc themselves. Something of a laundry list, but hopefully makes clear where I’m coming from:
My top-level concern is overconfidence: to the extent that we understand what’s going on, and things are going as expected, I think RSPs similar to Anthropic’s should be pretty good. This gives me very little comfort, since I expect catastrophes to occur when there’s something unexpected that we’ve failed to understand.
Both the blog post and the policy document fail to make this sufficiently clear.
Examples:
From the blog: “On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures.But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling...”.
This is not true: the incentive is to satisfy the conditions in the RSP. That’s likely to mean that the lab believesthey’ve solved the necessary safety issues. They may not be correct about that.
To the extent that they triple-check even after they think all is well, that’s based on morality/self-preservation. The RSP incentives do not push that way. Incorrectly believing they push that way doesn’t give me confidence.
No consideration of the possibility of jumping from ASL-(n) to ASL-(n+2).
No consideration of a model being ASL-(n+1), but showing no detectable warning signs beyond ASL-n. (there’s a bunch on bumping into ASL-n before expecting to—but not on this going undetected)
I expect the preceding to be unusual; conditional on catastrophe, I expect something unusual has happened.
On evals:
Demanding capabilities may be strongly correlated, so that it doesn’t matter too much if we fail to test for everything important. Alternatively, it could be the case that we do need to cover all the bases, since correlations aren’t as strong as we expect. In that case, [covering all the bases that we happen to think of] may not be sufficient. (this seems unlikely, but possible, to me)
More serious is the possibility that there are methods of capability elicitation/amplification that the red-teamers don’t find. For example, if no red-teamer had thought to try chain-of-thought approaches, capabilities might have been missed. Where is the guarantee that nothing like this is missed? I don’t see any correlation-based defense here—it seems quite possible that some ways to extract capabilities are just much better than others. What confidence level should we have that testing finds the best ways?
Why isn’t it emphasized that red-teaming can show that something is dangerous, but not that it’s safe? Where’s the discussion around how often we should expect tests to fail to catch important problems? Where’s the discussion around p(model is dangerous | model looks safe to us)? Is this low? Why? When? When will this change? How will we know?...
In general the doc seems to focus on [we’re using the best techniques currently available], and fails to make a case that [the best techniques currently available are sufficient].
E.g. page 16: “Evaluations should be based on the best capabilities elicitation techniques we are aware of at the time”
This worries me because governments/regulators are used to situations where state-of-the-art tests are always adequate (since building x tends to imply understanding x, outside ML). Therefore, I’d want to see this made explicit and clear.
This is the closest I can find, but it’s rather vague: ”Complying with higher ASLs is not just a procedural matter, but may sometimes require research or technical breakthroughs to give affirmative evidence of a model’s safety (which is generally not possible today)...” It’d be nice if the reader couldn’t assume throughout that the kind of research/breakthrough being talked about is the kind that’s routinely doable within a few months, rather than the kind that may take a decade.
Miscellaneous:
From the policy document, page 2: ”As AI systems continue to scale, they may become capable of increased autonomy that enables them to proliferate and, due to imperfections in current methods for steering such systems, potentially behave in ways contrary to the intent of their designers or users.”
To me ”...imperfections in current methods...” seems misleading—it gives the impression that labs basically know what they’re doing on alignment, but need to add a few tweaks here and there. I don’t believe this is true, and I’d be surprised to learn that many at Anthropic believe this.
Policy doc, page 3: ”Rather than try to define all future ASLs and their safety measures now (which would almost certainly not stand the test of time)...” This seems misleading since it’s not hard to define ASLs and safety measures which would stand the test of time: the difficult thing is to define measures that stand the test of time, but allow scaling to continue.
There’s an implicit assumption here that the correct course is to allow as much scaling as we can get away with, rather than to define strict measures that would stop things for the foreseeable future—given that we may be overconfident. I don’t think it’s crazy to believe the iterative approach is best, but I do think it deserves explicit argument. If the argument is “yes, stricter measures would be nice, but aren’t realistic right now”, then please say this (not just here in your post, I mean—somewhere clear to government people).
In particular, I think it’s principled to make clear that a lab would accept more strict conditions if they were universally enforced than those it would unilaterally adopt. Conversely, I find it worrying for a lab to say “we’re unilaterally doing x, and we think [everyone doing x] is the thing to aim for”, since I expect the x that makes unilateral sense to be inadequate as a global coordination target.
Page 10: ”We will manage our plans and finances to support a pause in model training if one proves necessary” This seems nice, but gives the impression more of [we might need to pause for six months] than [we might need to pause for ten years]. Given that the latter seems possible, it seems important to acknowledge that radical contingency plans would be necessary for this—and to have such plans (potentially with government assistance, and/or [stuff that hasn’t occurred to me]). Without that, there’ll be an unhelpful incentive to cut corners or to define inadequate ASLs on the basis that they seem more achievable.
I’m mostly not going to comment on Anthropic’s RSP right now, since I don’t really want this post to become about Anthropic’s RSP in particular. I’m happy to talk in more detail about Anthropic’s RSP maybe in a separate top-level post dedicated to it, but I’d prefer to keep the discussion here focused on RSPs in general.
One of my main worries with RSPs is that they’ll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That’s much worse than if they were clearly inadequate.
I definitely share this worry. But that’s part of why I’m writing this post! Because I think it is possible for us to get good RSPs from all the labs and governments, but it’ll take good policy and advocacy work to make that happen.
My main worry here isn’t that we’ll miss catastrophic capabilities in the near term (though it’s possible). Rather it’s the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there’s a decent chance some of them fail before we expect them to.
I agree that this is a serious concern, though I think that at least in the case of capabilities evaluations, it should be solvable. Though it’ll require those capabilities evaluations to actually be done effectively, I think we at least do know how to do effective capabilities evaluations—it’s mostly a solved problem in theory and just requires good implementation.
We need governments to make them mandatory before they’re necessary, not once we have them (NB, not [before it’s clear they’re necessary] - it might not be clear). I don’t expect us to have sufficiently accurate understanding-based evals before they’re necessary. (though it’d be lovely)
Pushing to require state-of-the-art safety techniques is the wrong emphasis.
We need to push for adequate safety techniques. If state-of-the-art techniques aren’t yet adequate, then labs need to stop.
The distinction between an alignment technique and an alignment evaluation is very important here: I very much am trying to push for adequate safety techniques rather than simply state-of-the-art safety techniques, and the way I’m proposing we do that is via evaluations that check whether we understand our models. What I think probably needs to happen before you can put understanding-based evals in an RSP is not that we have to solve mechanistic interpretability—it’s that we have to solve understanding-based evals. That is, we need to know how to evaluate whether mechanistic interpretability has been solved or not. My concern with trying to put something like that into an RSP right now is that it’ll end up evaluating the wrong thing: since we don’t yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
I think we at least do know how to do effective capabilities evaluations
This seems an overstatement to me: Where the main risk is misuse, we’d need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)
It seems reasonable to me to claim that “we know how to do effective [capabilities given sota elicitation methods] evaluations”, but that doesn’t answer the right question.
Once the main risk isn’t misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn’t realize we were relying upon]). Obviously we don’t expect these to break yet, but I’d guess that we’ll be surprised the first time they do break. I expect your guess on when they will break to be more accurate than mine—but that [I don’t have much of a clue, so I’m advocating extreme caution] may be the more reasonable policy.
My concern with trying to put something like [understanding-based evals] into an RSP right now is that it’ll end up evaluating the wrong thing: since we don’t yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
We don’t know how to put the concrete eval in the RSP, but we can certainly require that an eval for understanding passes. We can write in the RSP what the test would be intended to achieve, and conditions for the approval of the eval. E.g. [if at least two of David Krueger, Wei Dai and Abram Demski agree that this meets the bar for this category of understanding eval, then it does] (or whatever other criteria you might want).
Again, only putting targets that are well understood concretely in the RSP seems like a predictable way to fail to address poorly understood problems. Either the RSP needs to cover the poorly understood problems too—perhaps with a [you can’t pass this check without first coming up with a test and getting it approved] condition, or it needs a “THIS RSP IS INADEQUATE TO ENSURE SAFETY” warning in huge red letters on every page. (if the Anthropic RSP communicates this at all, it’s not emphasized nearly enough)
Thanks for writing this up.
I agree that the issue is important, though I’m skeptical of RSPs so far, since we have one example and it seems inadequate—to the extent that I’m positively disposed, it’s almost entirely down to personal encounters with Anthropic/ARC people, not least yourself. I find it hard to reconcile the thoughtfulness/understanding of the individuals with the tone/content of the Anthropic RSP. (of course I may be missing something in some cases)
Going only by the language in the blog post and the policy, I’d conclude that they’re an excuse to continue scaling while being respectably cautious (though not adequately cautious). Granted, I’m not the main target audience—but I worry about the impression the current wording creates.
I hope that RSPs can be beneficial—but I think much more emphasis should be on the need for positive demonstration of safety properties, that this is not currently possible, and that it may take many years for that to change. (mentioned, but not emphasized in the Anthropic policy—and without any “many years” or similar)
It’s hard to summarize my concerns, so apologies if the following ends up somewhat redundant.
I’ll focus on your post first, and the RSP blog/policy doc after that.
There’s an obvious thing to do here. It’s far from obvious that it’s a solution.
One of my main worries with RSPs is that they’ll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That’s much worse than if they were clearly inadequate.
They kick in when we detect that models have capabilities that we realize are relevant to downstream risks.
Both detection and realization can fail.
My main worry here isn’t that we’ll miss catastrophic capabilities in the near term (though it’s possible). Rather it’s the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there’s a decent chance some of them fail before we expect them to.
This could use greater emphasis in the RSP blog/doc.
Yes!
We need governments to make them mandatory before they’re necessary, not once we have them (NB, not [before it’s clear they’re necessary] - it might not be clear). I don’t expect us to have sufficiently accurate understanding-based evals before they’re necessary. (though it’d be lovely)
Pushing to require state-of-the-art safety techniques is the wrong emphasis.
We need to push for adequate safety techniques. If state-of-the-art techniques aren’t yet adequate, then labs need to stop.
Thoughts on the blog/doc themselves. Something of a laundry list, but hopefully makes clear where I’m coming from:
My top-level concern is overconfidence: to the extent that we understand what’s going on, and things are going as expected, I think RSPs similar to Anthropic’s should be pretty good. This gives me very little comfort, since I expect catastrophes to occur when there’s something unexpected that we’ve failed to understand.
Both the blog post and the policy document fail to make this sufficiently clear.
Examples:
From the blog: “On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling...”.
This is not true: the incentive is to satisfy the conditions in the RSP. That’s likely to mean that the lab believes they’ve solved the necessary safety issues. They may not be correct about that.
To the extent that they triple-check even after they think all is well, that’s based on morality/self-preservation. The RSP incentives do not push that way. Incorrectly believing they push that way doesn’t give me confidence.
No consideration of the possibility of jumping from ASL-(n) to ASL-(n+2).
No consideration of a model being ASL-(n+1), but showing no detectable warning signs beyond ASL-n. (there’s a bunch on bumping into ASL-n before expecting to—but not on this going undetected)
I expect the preceding to be unusual; conditional on catastrophe, I expect something unusual has happened.
On evals:
Demanding capabilities may be strongly correlated, so that it doesn’t matter too much if we fail to test for everything important. Alternatively, it could be the case that we do need to cover all the bases, since correlations aren’t as strong as we expect. In that case, [covering all the bases that we happen to think of] may not be sufficient. (this seems unlikely, but possible, to me)
More serious is the possibility that there are methods of capability elicitation/amplification that the red-teamers don’t find. For example, if no red-teamer had thought to try chain-of-thought approaches, capabilities might have been missed. Where is the guarantee that nothing like this is missed?
I don’t see any correlation-based defense here—it seems quite possible that some ways to extract capabilities are just much better than others. What confidence level should we have that testing finds the best ways?
Why isn’t it emphasized that red-teaming can show that something is dangerous, but not that it’s safe? Where’s the discussion around how often we should expect tests to fail to catch important problems? Where’s the discussion around p(model is dangerous | model looks safe to us)? Is this low? Why? When? When will this change? How will we know?...
In general the doc seems to focus on [we’re using the best techniques currently available], and fails to make a case that [the best techniques currently available are sufficient].
E.g. page 16: “Evaluations should be based on the best capabilities elicitation techniques we are aware of at the time”
This worries me because governments/regulators are used to situations where state-of-the-art tests are always adequate (since building x tends to imply understanding x, outside ML). Therefore, I’d want to see this made explicit and clear.
This is the closest I can find, but it’s rather vague:
”Complying with higher ASLs is not just a procedural matter, but may sometimes require research or technical breakthroughs to give affirmative evidence of a model’s safety (which is generally not possible today)...”
It’d be nice if the reader couldn’t assume throughout that the kind of research/breakthrough being talked about is the kind that’s routinely doable within a few months, rather than the kind that may take a decade.
Miscellaneous:
From the policy document, page 2:
”As AI systems continue to scale, they may become capable of increased autonomy that enables them to proliferate and, due to imperfections in current methods for steering such systems, potentially behave in ways contrary to the intent of their designers or users.”
To me ”...imperfections in current methods...” seems misleading—it gives the impression that labs basically know what they’re doing on alignment, but need to add a few tweaks here and there. I don’t believe this is true, and I’d be surprised to learn that many at Anthropic believe this.
Policy doc, page 3:
”Rather than try to define all future ASLs and their safety measures now (which would almost certainly not stand the test of time)...”
This seems misleading since it’s not hard to define ASLs and safety measures which would stand the test of time: the difficult thing is to define measures that stand the test of time, but allow scaling to continue.
There’s an implicit assumption here that the correct course is to allow as much scaling as we can get away with, rather than to define strict measures that would stop things for the foreseeable future—given that we may be overconfident.
I don’t think it’s crazy to believe the iterative approach is best, but I do think it deserves explicit argument. If the argument is “yes, stricter measures would be nice, but aren’t realistic right now”, then please say this (not just here in your post, I mean—somewhere clear to government people).
In particular, I think it’s principled to make clear that a lab would accept more strict conditions if they were universally enforced than those it would unilaterally adopt.
Conversely, I find it worrying for a lab to say “we’re unilaterally doing x, and we think [everyone doing x] is the thing to aim for”, since I expect the x that makes unilateral sense to be inadequate as a global coordination target.
Page 10:
”We will manage our plans and finances to support a pause in model training if one proves necessary”
This seems nice, but gives the impression more of [we might need to pause for six months] than [we might need to pause for ten years]. Given that the latter seems possible, it seems important to acknowledge that radical contingency plans would be necessary for this—and to have such plans (potentially with government assistance, and/or [stuff that hasn’t occurred to me]).
Without that, there’ll be an unhelpful incentive to cut corners or to define inadequate ASLs on the basis that they seem more achievable.
I’m mostly not going to comment on Anthropic’s RSP right now, since I don’t really want this post to become about Anthropic’s RSP in particular. I’m happy to talk in more detail about Anthropic’s RSP maybe in a separate top-level post dedicated to it, but I’d prefer to keep the discussion here focused on RSPs in general.
I definitely share this worry. But that’s part of why I’m writing this post! Because I think it is possible for us to get good RSPs from all the labs and governments, but it’ll take good policy and advocacy work to make that happen.
I agree that this is a serious concern, though I think that at least in the case of capabilities evaluations, it should be solvable. Though it’ll require those capabilities evaluations to actually be done effectively, I think we at least do know how to do effective capabilities evaluations—it’s mostly a solved problem in theory and just requires good implementation.
The distinction between an alignment technique and an alignment evaluation is very important here: I very much am trying to push for adequate safety techniques rather than simply state-of-the-art safety techniques, and the way I’m proposing we do that is via evaluations that check whether we understand our models. What I think probably needs to happen before you can put understanding-based evals in an RSP is not that we have to solve mechanistic interpretability—it’s that we have to solve understanding-based evals. That is, we need to know how to evaluate whether mechanistic interpretability has been solved or not. My concern with trying to put something like that into an RSP right now is that it’ll end up evaluating the wrong thing: since we don’t yet know how to effectively evaluate understanding, any evaluation we set up right now would probably be too game-able to actually be workable here.
This seems an overstatement to me:
Where the main risk is misuse, we’d need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)
It seems reasonable to me to claim that “we know how to do effective [capabilities given sota elicitation methods] evaluations”, but that doesn’t answer the right question.
Once the main risk isn’t misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn’t realize we were relying upon]). Obviously we don’t expect these to break yet, but I’d guess that we’ll be surprised the first time they do break.
I expect your guess on when they will break to be more accurate than mine—but that [I don’t have much of a clue, so I’m advocating extreme caution] may be the more reasonable policy.
We don’t know how to put the concrete eval in the RSP, but we can certainly require that an eval for understanding passes. We can write in the RSP what the test would be intended to achieve, and conditions for the approval of the eval. E.g. [if at least two of David Krueger, Wei Dai and Abram Demski agree that this meets the bar for this category of understanding eval, then it does] (or whatever other criteria you might want).
Again, only putting targets that are well understood concretely in the RSP seems like a predictable way to fail to address poorly understood problems.
Either the RSP needs to cover the poorly understood problems too—perhaps with a [you can’t pass this check without first coming up with a test and getting it approved] condition, or it needs a “THIS RSP IS INADEQUATE TO ENSURE SAFETY” warning in huge red letters on every page. (if the Anthropic RSP communicates this at all, it’s not emphasized nearly enough)