I’d be interested in your view on the comments made on Evan’s RSP post w.r.t unknown unknowns. I think aysja put it best in this comment. It seems important to move the burden of proof.
Would you consider “an unknown unknown causes a catastrophe” to be a “concrete way in which they fail to manage risk”? Concrete or not, this seems sufficient grounds to stop, unless there’s a clear argument that a bit more scaling actually helps for safety. (I’d be interested in your take on that—e.g. on what speed boost you might expect with your own research, given AI assistants of some level)
By default, I don’t expect the “affirmative case for safety that will require novel science” to be sufficient if it ends up looking like “We developed state of the art tools that address all known problems, and we don’t expect others”.
On the name, it’s not ‘responsible’ that bothers me, but rather ‘scaling’. ”Responsible x-ing policy” gives the strong impression that x-ing can be done responsibly, and that x-ing will continue. I’d prefer e.g. “Responsible training and deployment policy”. That way scaling isn’t a baked in presumption, and we’re naming things that we know can be done responsibly.
Unknown unknowns seem like a totally valid basis for concern.
But I don’t think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop. Without further elaboration I don’t think “unknown unknowns could cause a catastrophe” is enough to convince governments (or AI developers) to take significant actions.
I think RSPs make this situation better by pushing developers away from vague “Yeah we’ll be safe” to saying “Here’s what we’ll actually do” and allowing us to have a conversation about whether that specific thing sufficient to prevent risk early enough. I think this is way better, because vagueness and equivocation make scrutiny much harder.
My own take is that there is small but non-negligible risk before Anthropic’s ASL-3. For my part I’d vote to move to a lower threshold, or to require more stringent protective measures when working with any system bigger than LLaMA. But I’m not the median voter or decision-maker here (nor is Anthropic), and so I’ll say my piece but then move on to trying to convince people or to find a compromise that works.
The specific conversation is much better than nothing—but I do think it ought to be emphasized that solving all the problems we’re aware of isn’t sufficient for safety. We’re training on the test set.[1] Our confidence levels should reflect that—but I expect overconfidence.
It’s plausible that RSPs could be net positive, but I think that given successful coordination [vague and uncertain] beats [significantly more concrete, but overconfident]. My presumption is that without good coordination (a necessary condition being cautious decision-makers), things will go badly.
RSPs seem likely to increase the odds we get some international coordination and regulation. But to get sufficient regulation, we’d need the unknown unknowns issue to be covered at some point. To me this seems simplest to add clearly and explicitly from the beginning. Otherwise I expect regulation to adapt to issues for which we have concrete new evidence, and to fail to adapt beyond that.
Granted that you’re not the median voter/decision-maker—but you’re certainly one of the most, if not the most, influential voice on the issue. It seems important not to underestimate your capacity to change people’s views before figuring out a compromise to aim for (I’m primarily thinking of government people, who seem more likely to have views that might change radically based on a few conversations). But I’m certainly no expert on this kind of thing.
I do wonder whether it might be helpful not to share all known problems publicly on this basis—I’d have somewhat more confidence in safety measures that succeeded in solving some problems of a type the designers didn’t know about.
Thanks for writing this.
I’d be interested in your view on the comments made on Evan’s RSP post w.r.t unknown unknowns. I think aysja put it best in this comment. It seems important to move the burden of proof.
Would you consider “an unknown unknown causes a catastrophe” to be a “concrete way in which they fail to manage risk”? Concrete or not, this seems sufficient grounds to stop, unless there’s a clear argument that a bit more scaling actually helps for safety. (I’d be interested in your take on that—e.g. on what speed boost you might expect with your own research, given AI assistants of some level)
By default, I don’t expect the “affirmative case for safety that will require novel science” to be sufficient if it ends up looking like “We developed state of the art tools that address all known problems, and we don’t expect others”.
On the name, it’s not ‘responsible’ that bothers me, but rather ‘scaling’.
”Responsible x-ing policy” gives the strong impression that x-ing can be done responsibly, and that x-ing will continue. I’d prefer e.g. “Responsible training and deployment policy”. That way scaling isn’t a baked in presumption, and we’re naming things that we know can be done responsibly.
Unknown unknowns seem like a totally valid basis for concern.
But I don’t think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop. Without further elaboration I don’t think “unknown unknowns could cause a catastrophe” is enough to convince governments (or AI developers) to take significant actions.
I think RSPs make this situation better by pushing developers away from vague “Yeah we’ll be safe” to saying “Here’s what we’ll actually do” and allowing us to have a conversation about whether that specific thing sufficient to prevent risk early enough. I think this is way better, because vagueness and equivocation make scrutiny much harder.
My own take is that there is small but non-negligible risk before Anthropic’s ASL-3. For my part I’d vote to move to a lower threshold, or to require more stringent protective measures when working with any system bigger than LLaMA. But I’m not the median voter or decision-maker here (nor is Anthropic), and so I’ll say my piece but then move on to trying to convince people or to find a compromise that works.
The specific conversation is much better than nothing—but I do think it ought to be emphasized that solving all the problems we’re aware of isn’t sufficient for safety. We’re training on the test set.[1]
Our confidence levels should reflect that—but I expect overconfidence.
It’s plausible that RSPs could be net positive, but I think that given successful coordination [vague and uncertain] beats [significantly more concrete, but overconfident].
My presumption is that without good coordination (a necessary condition being cautious decision-makers), things will go badly.
RSPs seem likely to increase the odds we get some international coordination and regulation. But to get sufficient regulation, we’d need the unknown unknowns issue to be covered at some point. To me this seems simplest to add clearly and explicitly from the beginning. Otherwise I expect regulation to adapt to issues for which we have concrete new evidence, and to fail to adapt beyond that.
Granted that you’re not the median voter/decision-maker—but you’re certainly one of the most, if not the most, influential voice on the issue. It seems important not to underestimate your capacity to change people’s views before figuring out a compromise to aim for (I’m primarily thinking of government people, who seem more likely to have views that might change radically based on a few conversations). But I’m certainly no expert on this kind of thing.
I do wonder whether it might be helpful not to share all known problems publicly on this basis—I’d have somewhat more confidence in safety measures that succeeded in solving some problems of a type the designers didn’t know about.