I like a bunch of the post, but I do feel confused about the framing. At the present, substantially scaling up AI systems is probably not benign. As such, demonstrating its benigness is not the issue, it’s actually making them benign (or credibly demonstrating their non-benignness).
Miles doesn’t clarify what level of capabilities he is talking about, but various contextual clues make me think he is talking about the next few generations of frontier systems. I would definitely like to know for sure if those are not benign or benign, but I think there is a quite substantial chance they are not, and in that case the issue is not demonstrating benignness, but like, actually making them benign.
Miles never says this directly, but the headline and the framing of the article really scream to me like it’s assuming that these upcoming systems are benign, while implying that it’s understandable that policy makers and decision-makers are at present not convinced of that fact (while the author is). This is again, not super explicitly said in the post, but I do think it’s a really strong implication of the title and various paragraphs within it. I would be curious in takes from other people whether they reacted similarly.
Ok, coming back to this a few hours later, here is roughly what feels off to me. I think the key thing is that the priority should be “developing measurement tools that tell us with high confidence whether systems are benign”, but Miles somehow frames this as “measurement tools that tell us with high confidence that AI systems are benign”, which sounds confused in the same way that a scientist saying “what we need are experiments that confirm my theory” sounds kind of confused (I think there are important differences between bayesian and scientific evidence, and a scientist can have justified confidence before the scientific community will accept their theory, but still, something seems wrong when a scientist is saying the top priority is to develop experiments that confirm their theory, as opposed to “experiments that tell us whether my theory is right”).
“Generate evidence of difficulty” as a research purpose
How to handle the problem of AI risk is one of, if not the most important and consequential strategic decisions facing humanity. If we err in the direction of too much caution, in the short run resources are diverted into AI safety projects that could instead go to other x-risk efforts, and in the long run, billions of people could unnecessarily die while we hold off on building “dangerous” AGI and wait for “safe” algorithms to come along. If we err in the opposite direction, well presumably everyone here already knows the downside there.
A crucial input into this decision is the difficulty of AI safety, and the obvious place for decision makers to obtain evidence about the difficulty of AI safety is from technical AI safety researchers (and AI researchers in general), but it seems that not many people have given much thought on how to optimize for the production and communication of such evidence (leading to communication gaps like this one). (As another example, many people do not seem to consider that doing research on a seemingly intractably difficult problem can be valuable because it can at least generate evidence of difficulty of that particular line of research.)
But I do think that section of the post handles the tradeoff a lot better, and gives me a lot less of the “something is off” vibes.
demonstrating its benigness is not the issue, it’s actually making them benign
This also stood out to me.
Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you’re benign + actually be benign.
For now, model evals for dangerous capabilities (along with assurance that you’re evaluating the lab’s most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).
Two specific places I think are misleading:
Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.
. . .
We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.
The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you’re there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.
Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it’s still a good post on another problem.
It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.
So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I’m not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.
I like a bunch of the post, but I do feel confused about the framing. At the present, substantially scaling up AI systems is probably not benign. As such, demonstrating its benigness is not the issue, it’s actually making them benign (or credibly demonstrating their non-benignness).
Miles doesn’t clarify what level of capabilities he is talking about, but various contextual clues make me think he is talking about the next few generations of frontier systems. I would definitely like to know for sure if those are not benign or benign, but I think there is a quite substantial chance they are not, and in that case the issue is not demonstrating benignness, but like, actually making them benign.
Miles never says this directly, but the headline and the framing of the article really scream to me like it’s assuming that these upcoming systems are benign, while implying that it’s understandable that policy makers and decision-makers are at present not convinced of that fact (while the author is). This is again, not super explicitly said in the post, but I do think it’s a really strong implication of the title and various paragraphs within it. I would be curious in takes from other people whether they reacted similarly.
Ok, coming back to this a few hours later, here is roughly what feels off to me. I think the key thing is that the priority should be “developing measurement tools that tell us with high confidence whether systems are benign”, but Miles somehow frames this as “measurement tools that tell us with high confidence that AI systems are benign”, which sounds confused in the same way that a scientist saying “what we need are experiments that confirm my theory” sounds kind of confused (I think there are important differences between bayesian and scientific evidence, and a scientist can have justified confidence before the scientific community will accept their theory, but still, something seems wrong when a scientist is saying the top priority is to develop experiments that confirm their theory, as opposed to “experiments that tell us whether my theory is right”).
Edit: There is a similar post by Wei Dai a long time ago that I had some similar feelings about: https://www.lesswrong.com/posts/dt4z82hpvvPFTDTfZ/six-ai-risk-strategy-ideas#_Generate_evidence_of_difficulty__as_a_research_purpose
But I do think that section of the post handles the tradeoff a lot better, and gives me a lot less of the “something is off” vibes.
This also stood out to me.
Demonstrate benignness is roughly equivalent to be transparent enough that the other actor can determine whether you’re benign + actually be benign.
For now, model evals for dangerous capabilities (along with assurance that you’re evaluating the lab’s most powerful model + there are no secret frontier AI projects) could suffice to demonstrate benignness. After they no longer work, even just getting good evidence that your AI project [model + risk assessment policy + safety techniques deployment plans + weights/code security + etc.] is benign is an open problem, nevermind the problem of proving that to other actors (nevermind doing so securely and at reasonable cost).
Two specific places I think are misleading:
The bigger problem is (building powerful AI such that your project is benign and) getting strong evidence for yourself that your project is benign. But once you’re there, demonstrating benignness to other actors is another problem, lesser but potentially important and under-considered.
Or: this post mostly sets aside the alignment/control problem, and it should have signposted that it was doing so better, but it’s still a good post on another problem.
It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.
So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I’m not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.