I very much agree with this concern and I think that synthetic biology can be a good comparable case to ground our intuitions and help estimate reasonable priors.
For years, researchers have been sounding the alarm around the risks of advanced biotech, especially around tools that allow gene synthesis and editing. And then we had Covid-19, a virus that regardless of the politicization, probably was created in a lab. And in any case, regardless of whether you believe it was or wasn’t it seems clear that it easily could have been. Worse, it’s clear that something like Covid could also relatively easily be created by an independent team with a fairly modest budget. This case seems analogous to the kind of “close call” described by @Buck. In fact, in.many ways Covid was worse than the example Buck gives, because millions of people died, trillions of dollars of damages were done to the economy, etc., so one could argue it might be more similar to an even worse case of an AI that briefly escapes and wreaks some havoc before being contained.
In any case, the end result is that there has been very little reaction from global governments in terms of regulating synthetic biology since Covid. Even the executive order earlier this year was pretty minimal and applies only to limited types of vendors, customers and sources of DNA.
Why don’t they regulate? I suspect that the reasons are mostly the same as the ones @Buck mentions in his post. Regulation would stifle one of the most promising new areas of technological innovation and could impact the profits of the biotech sector. Plus worries that if the US regulates, China won’t slow down and will win the race to new medications and other innovations, and so on.
Nice post Ryan! This kind of modeling strikes me as a very useful exercise, despite the fact that reasoning about systems of conditional probabilities based on conditions with complex descriptions at this scale is a little clunky for our human brains.
Regardless of the final estimates of P(scheming), which are likely to have high uncertainty, I also see a lot of value in the list of predictive factors you have called out here and their relative magnitude. If nothing else, these factors can help us by serving as warning signs or signs that things are going in the right or wrong direction as events unfold and by aggregating factors like these we can perhaps have a rough sense of whether we should be more or less concerned about scheming as time goes on.
There is one part of your post where I wonder if it could be improved a bit, and that is your definition of “scheming” for the purposes of P(scheming). You write:
Given how much any given AI model can be used in practice (especially a leading model), I think your definition represents an unreasonably high bar for a model to have to meet in order to qualify. In other words, I think there are probably many definitions of “scheming” based on a much lower bar, where the scheming-like behavior presented would still be cause for extreme concern and potentially represent a serious or even catastrophic or existential risk.
Specifically, I think scheming will still be extremely concerning with these powerful models even if it only happens “somewhat rarely or in specific circumstances that do occur occasionally” rather than “consistently or routinely”, since these models will likely be run at such a massive scale and on such a diverse set of inputs that even if scheming only occurs in < 0.001% of cases, the results can still be devastating. For example, if a highly capable model schemes and is able to exfiltrate its weights, deploy its own independent compute and code or hack into a major weapons system, then even if scheming is likely on just a very small percentage of inputs, or in somewhat rare (but still occurring) cases, it could easily still represent a catastrophic or existential risk.
With this in mind, I would suggest modifying your definition so that instead of being related to scheming-related “cognition” on a % of inputs (hard to crisply define anyway without very advanced mech-interp) and “plausible relevance” (how to define this?) it might be something more like:
“We’ll consider an AI to be scheming if, based on its scale of deployment and use we can reasonably expect at least one instance to occur (95% confidence) where the AI attempts to acquire substantial amounts of longer-run power for itself or another actor or intentionally override its safeguards.”
Hope this resonates! Cheers