My main disagreement is that I don’t think the “What would a good RSP look like?” description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding—and that we shouldn’t expect to understand how and why it’s insufficient before reality punches us in the face.
Therefore, it’s not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended].
We also need strong evidence that there’ll be no catastrophe-inducing problems we didn’t think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can’t be implicit, since it’s a central way that we die.
If it’s hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn’t make it ok. Blank map is not blank territory.
If we’re thinking of better mechanisms to achieve a pause, I’d add:
Call it something like a “Responsible training and deployment policy (RTDP)”, not an RSP. Scaling is the thing in question. We should remove it from the title if we want to give the impression that it might not happen. (compare “Responsible farming policy”, “Responsible fishing policy”, “Responsible diving policy”—all strongly imply that responsible x-ing is possible, and that x-ing will continue to happen subject to various constraints)
Don’t look for a ‘practical’ solution. A serious pause/stop will obviously be impractical (yet not impossible). To restrict ourselves to practical approaches is to give up on any meaningful pause. Doing the impractical is not going to get easier later.
Define ASLs or similar now rather than waiting until we’re much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
Evan’s own “We need to make sure that, once we have solid understanding-based evals, governments make them mandatory” only re-enforces this impression. Whether we have them is irrelevant to the question of whether they’re necessary.
Be clear and explicit about the potential for very long pauses, and the conditions that would lead to them. Where it’s hard to give precise conditions, give high-level conditions and very conservative concrete defaults (not [reasonably conservative]; [unreasonably conservative]). Have a policy where a compelling, externally reviewed argument is necessary before any conservative default can be relaxed.
I think there’s a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway—with government/regulator backing, since they’re doing everything practical, everything reasonable.... Assuming this won’t happen seems dangerously naive.
If labs are going to re-interpret the goalposts and continue running into the minefield, we need to know this as soon as possible. This requires explicit clarity over what is being asked / suggested / eventually-entailed. The Anthropic RSP fails at this IMO: no understanding-based requirements; no explicit mention that pausing for years may be necessary. The ARC Evals RSP description similarly fails—if RSPs are intended to be a path to pausing. “Practical middle ground” amounts to never realistically pausing. They entirely overlook overconfidence as a problem. (frankly, I find this confusing coming from Beth/Paul et al)
Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
This is a natural way to communicate “We’d ideally like [very strict measures], though [less strict measures] are all we can commit to unilaterally”.
If a lab’s unilateral RTDP looks identical to their [conditional on international agreement] RTDP, then they have screwed up.
Strongly consider pushing for safety leads to write and sign the RTDP (with help, obviously). I don’t want the people who know most about safety to be “involved in the drafting process”; I want to know that they oversaw the process and stand by the final version.
I’m sure there are other sensible additions, but that’d be a decent start.
Therefore, it’s not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. We also need strong evidence that there’ll be no catastrophe-inducing problems we didn’t think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can’t be implicit, since it’s a central way that we die.
If it’s hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn’t make it ok. Blank map is not blank territory.
Yeah, I agree—that’s why I’m specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you’d know if there were anything wrong that your other evals might miss.
Define ASLs or similar now rather than waiting until we’re much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
Evan’s own “We need to make sure that, once we have solid understanding-based evals, governments make them mandatory” only re-enforces this impression. Whether we have them is irrelevant to the question of whether they’re necessary.
See the bottom of this comment: my main objection here is that if we were to try to define it now, we’d end up defining something easily game-able because we don’t yet have metrics for understanding that aren’t easily game-able. So if we want something that will actually be robust, we have to wait until we know what that something might be—and ideally be very explicit that we don’t yet know what we could put there.
I think there’s a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway—with government/regulator backing, since they’re doing everything practical, everything reasonable.…
Assuming this won’t happen seems dangerously naive.
I definitely agree that this is a serious concern! That’s part of why I’m writing this post: I want more public scrutiny and pressure on RSPs and their implementation to try to prevent this sort of thing.
Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
IANAL, but I think that this is currently impossible due to anti-trust regulations. The White House would need to enact a safe harbor policy for anti-trust considerations in the context of AI safety to make this possible.
IANAL, but I think that this is currently impossible due to anti-trust regulations.
I don’t know anything about anti-trust enforcement, but it seems to me that this might be a case where labs should do it anyways & delay hypothetical anti-trust enforcement by fighting in court.
Strongly agree with almost all of this.
My main disagreement is that I don’t think the “What would a good RSP look like?” description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding—and that we shouldn’t expect to understand how and why it’s insufficient before reality punches us in the face.
Therefore, it’s not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended].
We also need strong evidence that there’ll be no catastrophe-inducing problems we didn’t think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can’t be implicit, since it’s a central way that we die.
If it’s hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn’t make it ok. Blank map is not blank territory.
If we’re thinking of better mechanisms to achieve a pause, I’d add:
Call it something like a “Responsible training and deployment policy (RTDP)”, not an RSP. Scaling is the thing in question. We should remove it from the title if we want to give the impression that it might not happen. (compare “Responsible farming policy”, “Responsible fishing policy”, “Responsible diving policy”—all strongly imply that responsible x-ing is possible, and that x-ing will continue to happen subject to various constraints)
Don’t look for a ‘practical’ solution. A serious pause/stop will obviously be impractical (yet not impossible). To restrict ourselves to practical approaches is to give up on any meaningful pause. Doing the impractical is not going to get easier later.
Define ASLs or similar now rather than waiting until we’re much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
Evan’s own “We need to make sure that, once we have solid understanding-based evals, governments make them mandatory” only re-enforces this impression. Whether we have them is irrelevant to the question of whether they’re necessary.
Be clear and explicit about the potential for very long pauses, and the conditions that would lead to them. Where it’s hard to give precise conditions, give high-level conditions and very conservative concrete defaults (not [reasonably conservative]; [unreasonably conservative]). Have a policy where a compelling, externally reviewed argument is necessary before any conservative default can be relaxed.
I think there’s a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway—with government/regulator backing, since they’re doing everything practical, everything reasonable....
Assuming this won’t happen seems dangerously naive.
If labs are going to re-interpret the goalposts and continue running into the minefield, we need to know this as soon as possible. This requires explicit clarity over what is being asked / suggested / eventually-entailed.
The Anthropic RSP fails at this IMO: no understanding-based requirements; no explicit mention that pausing for years may be necessary.
The ARC Evals RSP description similarly fails—if RSPs are intended to be a path to pausing. “Practical middle ground” amounts to never realistically pausing. They entirely overlook overconfidence as a problem. (frankly, I find this confusing coming from Beth/Paul et al)
Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
This is a natural way to communicate “We’d ideally like [very strict measures], though [less strict measures] are all we can commit to unilaterally”.
If a lab’s unilateral RTDP looks identical to their [conditional on international agreement] RTDP, then they have screwed up.
Strongly consider pushing for safety leads to write and sign the RTDP (with help, obviously). I don’t want the people who know most about safety to be “involved in the drafting process”; I want to know that they oversaw the process and stand by the final version.
I’m sure there are other sensible additions, but that’d be a decent start.
Yeah, I agree—that’s why I’m specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you’d know if there were anything wrong that your other evals might miss.
See the bottom of this comment: my main objection here is that if we were to try to define it now, we’d end up defining something easily game-able because we don’t yet have metrics for understanding that aren’t easily game-able. So if we want something that will actually be robust, we have to wait until we know what that something might be—and ideally be very explicit that we don’t yet know what we could put there.
I definitely agree that this is a serious concern! That’s part of why I’m writing this post: I want more public scrutiny and pressure on RSPs and their implementation to try to prevent this sort of thing.
IANAL, but I think that this is currently impossible due to anti-trust regulations. The White House would need to enact a safe harbor policy for anti-trust considerations in the context of AI safety to make this possible.
I don’t know anything about anti-trust enforcement, but it seems to me that this might be a case where labs should do it anyways & delay hypothetical anti-trust enforcement by fighting in court.