demonstrating its benigness is not the issue, it’s actually making them benign
This also stood out to me.
Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you’re benign + actually be benign.
For now, model evals for dangerous capabilities (along with assurance that you’re evaluating the lab’s most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).
Two specific places I think are misleading:
Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.
. . .
We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.
The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you’re there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.
Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it’s still a good post on another problem.
It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.
So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I’m not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.
This also stood out to me.
Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you’re benign + actually be benign.
For now, model evals for dangerous capabilities (along with assurance that you’re evaluating the lab’s most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).
Two specific places I think are misleading:
The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you’re there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.
Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it’s still a good post on another problem.
It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.
So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I’m not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.