Therefore, it’s not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. We also need strong evidence that there’ll be no catastrophe-inducing problems we didn’t think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can’t be implicit, since it’s a central way that we die.
If it’s hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn’t make it ok. Blank map is not blank territory.
Yeah, I agree—that’s why I’m specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you’d know if there were anything wrong that your other evals might miss.
Define ASLs or similar now rather than waiting until we’re much closer to achieving them. Waiting to define them later gives the strong impression that the approach is [pick the strongest ASL definitions and measures that will achievable so that we can keep scaling] and not, [pick ASL definitions and measures that are clearly sufficient for safety].
Evan’s own “We need to make sure that, once we have solid understanding-based evals, governments make them mandatory” only re-enforces this impression. Whether we have them is irrelevant to the question of whether they’re necessary.
See the bottom of this comment: my main objection here is that if we were to try to define it now, we’d end up defining something easily game-able because we don’t yet have metrics for understanding that aren’t easily game-able. So if we want something that will actually be robust, we have to wait until we know what that something might be—and ideally be very explicit that we don’t yet know what we could put there.
I think there’s a big danger in safety people getting something in place that we think/hope will imply a later pause, only to find that when it really counts the labs decide not to interpret things that way and to press forward anyway—with government/regulator backing, since they’re doing everything practical, everything reasonable.…
Assuming this won’t happen seems dangerously naive.
I definitely agree that this is a serious concern! That’s part of why I’m writing this post: I want more public scrutiny and pressure on RSPs and their implementation to try to prevent this sort of thing.
Have separate RTDPs for unilateral adoption, and adoption subject to multi-lab agreement / international agreement etc. (I expect at least three levels would make sense)
IANAL, but I think that this is currently impossible due to anti-trust regulations. The White House would need to enact a safe harbor policy for anti-trust considerations in the context of AI safety to make this possible.
IANAL, but I think that this is currently impossible due to anti-trust regulations.
I don’t know anything about anti-trust enforcement, but it seems to me that this might be a case where labs should do it anyways & delay hypothetical anti-trust enforcement by fighting in court.
Yeah, I agree—that’s why I’m specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you’d know if there were anything wrong that your other evals might miss.
See the bottom of this comment: my main objection here is that if we were to try to define it now, we’d end up defining something easily game-able because we don’t yet have metrics for understanding that aren’t easily game-able. So if we want something that will actually be robust, we have to wait until we know what that something might be—and ideally be very explicit that we don’t yet know what we could put there.
I definitely agree that this is a serious concern! That’s part of why I’m writing this post: I want more public scrutiny and pressure on RSPs and their implementation to try to prevent this sort of thing.
IANAL, but I think that this is currently impossible due to anti-trust regulations. The White House would need to enact a safe harbor policy for anti-trust considerations in the context of AI safety to make this possible.
I don’t know anything about anti-trust enforcement, but it seems to me that this might be a case where labs should do it anyways & delay hypothetical anti-trust enforcement by fighting in court.