If you go with an assumption of good faith then the partial, gappy RSPs we’ve seen are still a major step towards having a functional internal policy to not develop dangerous AI systems because you’ll assume the gaps will be filled in due course. However, if we don’t assume a good faith commitment to implement a functional version of what’s suggested in a preliminary RSP without some kind of external pressure, then they might not be worth much more than the paper they’re printed on.
But, even if the RSPs aren’t drafted in good faith and the companies don’t have a strong safety culture (which seems to be true of OpenAI judging by what Jan Leike said), you can still have the RSP commitment rule be a foundation for actually effective policies down the line.
For comparison, if a lot of dodgy water companies sign on to a ‘voluntary compact’ to develop some sort of plan to assess the risk of sewage spills then probably the risk is reduced by a bit, but it also makes it easier to develop better requirements later, for example by saying “Our new requirement is the same as last years but now you must publish your risk assessment results openly” and daring them to back out. You can encourage them to compete on PR by making their commitments more comprehensive than their opponents and create a virtuous cycle, and it probably just draws more attention to the plans than there was before.
You can still have the RSP commitment rule be a foundation for actually effective policies down the line
+1. I do think it’s worth noting, though, that RSPs might not be a sensible foundation for effective policies.
One of my colleagues recently mentioned that the voluntary commitments from labs are much weaker than some of the things that the G7 Hiroshima Process has been working on.
More tangibly, it’s quite plausible to me that policymakers who think about AI risks from first principles would produce things that are better and stronger than “codify RSPs.” Some thoughts:
It’s plausible to me that when the RSP concept was first being developed, it was a meaningful improvement on the status quo, but the Overton Window & awareness of AI risk has moved a lot since then.
It’s plausible to me that RSPs set a useful “floor”– like hey this is the bare minimum.
It’s plausible to me that RSPs are useful for raising awareness about risk– like hey look, OpenAI and Anthropic are acknowledging that models might soon have dangerous CBRN capabilities.
But there are a lot of implicit assumptions in the RSP frame like “we need to have empirical evidence of risk before we do anything” (as opposed to an affirmative safety frame), “we just need to make sure we implement the right safeguards once things get dangerous” (as opposed to a frame that recognizes we might not have time to develop such safeguards once we have clear evidence of danger), and “AI development should roughly continue as planned” (as opposed to a frame that considers alternative models, like public-private partnerships).
More concretely, I would rather see policy based on things like the recent Bengio paper than RSPs. Examples:
Despite evaluations, we cannot consider coming powerful frontier AI systems “safe unless proven unsafe.” With present testing methodologies, issues can easily be missed. Additionally, it is unclear whether governments can quickly build the immense expertise needed for reliable technical evaluations of AI capabilities and societal-scale risks. Given this, developers of frontier AI should carry the burden of proof to demonstrate that their plans keep risks within acceptable limits. By doing so, they would follow best practices for risk management from industries, such as aviation, medical devices, and defense software, in which companies make safety cases
Commensurate mitigations are needed for exceptionally capable future AI systems, such as autonomous systems that could circumvent human control. Governments must be prepared to license their development, restrict their autonomy in key societal roles, halt their development and deployment in response to worrying capabilities, mandate access controls, and require information security measures robust to state-level hackers until adequate protections are ready. Governments should build these capacities now.
Sometimes advocates of RSPs say “these are things that are compatible with RSPs”, but overall I have not seen RSPs/PFs/FSFs that are nearly this clear about the risks, this clear about the limitations of model evaluations, or this clear about the need for tangible regulations.
I’ve feared previously (and continue to fear) that there are some motte-and-bailey dynamics at play with RSPs, where proponents of RSPs say privately (and to safety people) that RSPs are meant to have strong commitments and inspire strong regulation, but then in practice the RSPs are very weak and end up conveying and overly-rosy picture to policymakers.
One of my colleagues recently mentioned that the voluntary commitments from labs are much weaker than some of the things that the G7 Hiroshima Process has been working on.
If you go with an assumption of good faith then the partial, gappy RSPs we’ve seen are still a major step towards having a functional internal policy to not develop dangerous AI systems because you’ll assume the gaps will be filled in due course. However, if we don’t assume a good faith commitment to implement a functional version of what’s suggested in a preliminary RSP without some kind of external pressure, then they might not be worth much more than the paper they’re printed on.
But, even if the RSPs aren’t drafted in good faith and the companies don’t have a strong safety culture (which seems to be true of OpenAI judging by what Jan Leike said), you can still have the RSP commitment rule be a foundation for actually effective policies down the line.
For comparison, if a lot of dodgy water companies sign on to a ‘voluntary compact’ to develop some sort of plan to assess the risk of sewage spills then probably the risk is reduced by a bit, but it also makes it easier to develop better requirements later, for example by saying “Our new requirement is the same as last years but now you must publish your risk assessment results openly” and daring them to back out. You can encourage them to compete on PR by making their commitments more comprehensive than their opponents and create a virtuous cycle, and it probably just draws more attention to the plans than there was before.
+1. I do think it’s worth noting, though, that RSPs might not be a sensible foundation for effective policies.
One of my colleagues recently mentioned that the voluntary commitments from labs are much weaker than some of the things that the G7 Hiroshima Process has been working on.
More tangibly, it’s quite plausible to me that policymakers who think about AI risks from first principles would produce things that are better and stronger than “codify RSPs.” Some thoughts:
It’s plausible to me that when the RSP concept was first being developed, it was a meaningful improvement on the status quo, but the Overton Window & awareness of AI risk has moved a lot since then.
It’s plausible to me that RSPs set a useful “floor”– like hey this is the bare minimum.
It’s plausible to me that RSPs are useful for raising awareness about risk– like hey look, OpenAI and Anthropic are acknowledging that models might soon have dangerous CBRN capabilities.
But there are a lot of implicit assumptions in the RSP frame like “we need to have empirical evidence of risk before we do anything” (as opposed to an affirmative safety frame), “we just need to make sure we implement the right safeguards once things get dangerous” (as opposed to a frame that recognizes we might not have time to develop such safeguards once we have clear evidence of danger), and “AI development should roughly continue as planned” (as opposed to a frame that considers alternative models, like public-private partnerships).
More concretely, I would rather see policy based on things like the recent Bengio paper than RSPs. Examples:
Sometimes advocates of RSPs say “these are things that are compatible with RSPs”, but overall I have not seen RSPs/PFs/FSFs that are nearly this clear about the risks, this clear about the limitations of model evaluations, or this clear about the need for tangible regulations.
I’ve feared previously (and continue to fear) that there are some motte-and-bailey dynamics at play with RSPs, where proponents of RSPs say privately (and to safety people) that RSPs are meant to have strong commitments and inspire strong regulation, but then in practice the RSPs are very weak and end up conveying and overly-rosy picture to policymakers.
Are you able to say more about this?