More broadly, what do you think we should do once we’re starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
We think it’s likely that prior to finding that evals are unworkable, we find that control techniques would fail to ensure reasonable levels of safety and usefulness. That is, in order to ensure safety for typical use cases, we have to impose heavy restrictions that heavily restrict usefulness.
See this comment for more discussion of what could fail.
Regardless of what fails first, we can discuss what we should do.
First, you’d notify the key stake holders that control doesn’t suffice for the next AIs. (And ideally produce legible demonstrations of this etc.)
If we’ve already reached a sufficient level of AI output that we’re operating at pretty crazy and near maximal speeds (e.g. the Amdahl’s law style argument here), then we’re in pretty good shape from the perspective of a control plan (from a technical perspective). We’d hope that in this position we can ensure some reasonable delay prior to being forced to build and use AIs in ways that break control. We might be forced to build more powerful AIs in this situation because of other actors building insanely powerful AIs that we might need to engage in active conflict against (and we can’t just stop these AIs from being trained which might be way less risky) and it could also be that internal forces want to irrationally rush to making more powerful AIs.
In the case where we’re not operating at sufficiently high output, the situation is considerably worse. If the AI lab is operating in a sensible way, then the hope would be that you do a careful tradeoff between likely exogenous risk and the risk that it seems like you would be incuring from various usages of the AI[1]. So if exogenous risk isn’t there, we’d just stop and work on developing better approaches. If we need to proceed without further progress, than we’d just try to make the best safety usefulness tradeoffs that we can afford.
Given reasonable guesses at P(scheming) for different AIs and more detailed views on: whether various AIs might scheme but not cooperate with each other and how bad various bad outcomes would be relative to the exogenous threats (e.g. how does an AI escaping from our lab compare in badness to various exogenous risks).
We think it’s likely that prior to finding that evals are unworkable, we find that control techniques would fail to ensure reasonable levels of safety and usefulness. That is, in order to ensure safety for typical use cases, we have to impose heavy restrictions that heavily restrict usefulness.
See this comment for more discussion of what could fail.
Regardless of what fails first, we can discuss what we should do.
First, you’d notify the key stake holders that control doesn’t suffice for the next AIs. (And ideally produce legible demonstrations of this etc.)
If we’ve already reached a sufficient level of AI output that we’re operating at pretty crazy and near maximal speeds (e.g. the Amdahl’s law style argument here), then we’re in pretty good shape from the perspective of a control plan (from a technical perspective). We’d hope that in this position we can ensure some reasonable delay prior to being forced to build and use AIs in ways that break control. We might be forced to build more powerful AIs in this situation because of other actors building insanely powerful AIs that we might need to engage in active conflict against (and we can’t just stop these AIs from being trained which might be way less risky) and it could also be that internal forces want to irrationally rush to making more powerful AIs.
In the case where we’re not operating at sufficiently high output, the situation is considerably worse. If the AI lab is operating in a sensible way, then the hope would be that you do a careful tradeoff between likely exogenous risk and the risk that it seems like you would be incuring from various usages of the AI[1]. So if exogenous risk isn’t there, we’d just stop and work on developing better approaches. If we need to proceed without further progress, than we’d just try to make the best safety usefulness tradeoffs that we can afford.
Given reasonable guesses at P(scheming) for different AIs and more detailed views on: whether various AIs might scheme but not cooperate with each other and how bad various bad outcomes would be relative to the exogenous threats (e.g. how does an AI escaping from our lab compare in badness to various exogenous risks).