demonstrating its benigness is not the issue, it’s actually making them benign
This also stood out to me.
Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you’re benign + actually be benign.
For now, model evals for dangerous capabilities (along with assurance that you’re evaluating the lab’s most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).
Two specific places I think are misleading:
Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.
. . .
We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.
The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you’re there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.
Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it’s still a good post on another problem.
This also stood out to me.
Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you’re benign + actually be benign.
For now, model evals for dangerous capabilities (along with assurance that you’re evaluating the lab’s most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).
Two specific places I think are misleading:
The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you’re there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.
Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it’s still a good post on another problem.