Nice post Ryan! This kind of modeling strikes me as a very useful exercise, despite the fact that reasoning about systems of conditional probabilities based on conditions with complex descriptions at this scale is a little clunky for our human brains.
Regardless of the final estimates of P(scheming), which are likely to have high uncertainty, I also see a lot of value in the list of predictive factors you have called out here and their relative magnitude. If nothing else, these factors can help us by serving as warning signs or signs that things are going in the right or wrong direction as events unfold and by aggregating factors like these we can perhaps have a rough sense of whether we should be more or less concerned about scheming as time goes on.
There is one part of your post where I wonder if it could be improved a bit, and that is your definition of “scheming” for the purposes of P(scheming). You write:
As discussed in an earlier footnote, for the purposes of this doc, I’ll consider an AI to be scheming if it consistently or routinely plots against its creators to acquire substantial amounts of longer-run power either for itself or some other actor (precisely, it does some cognition related to this on >20% of inputs where scheming is plausibly relevant), despite its creators trying (at least a bit) to train it not to do this.
Given how much any given AI model can be used in practice (especially a leading model), I think your definition represents an unreasonably high bar for a model to have to meet in order to qualify. In other words, I think there are probably many definitions of “scheming” based on a much lower bar, where the scheming-like behavior presented would still be cause for extreme concern and potentially represent a serious or even catastrophic or existential risk.
Specifically, I think scheming will still be extremely concerning with these powerful models even if it only happens “somewhat rarely or in specific circumstances that do occur occasionally” rather than “consistently or routinely”, since these models will likely be run at such a massive scale and on such a diverse set of inputs that even if scheming only occurs in < 0.001% of cases, the results can still be devastating. For example, if a highly capable model schemes and is able to exfiltrate its weights, deploy its own independent compute and code or hack into a major weapons system, then even if scheming is likely on just a very small percentage of inputs, or in somewhat rare (but still occurring) cases, it could easily still represent a catastrophic or existential risk.
With this in mind, I would suggest modifying your definition so that instead of being related to scheming-related “cognition” on a % of inputs (hard to crisply define anyway without very advanced mech-interp) and “plausible relevance” (how to define this?) it might be something more like:
“We’ll consider an AI to be scheming if, based on its scale of deployment and use we can reasonably expect at least one instance to occur (95% confidence) where the AI attempts to acquire substantial amounts of longer-run power for itself or another actor or intentionally override its safeguards.”
Nice post Ryan! This kind of modeling strikes me as a very useful exercise, despite the fact that reasoning about systems of conditional probabilities based on conditions with complex descriptions at this scale is a little clunky for our human brains.
Regardless of the final estimates of P(scheming), which are likely to have high uncertainty, I also see a lot of value in the list of predictive factors you have called out here and their relative magnitude. If nothing else, these factors can help us by serving as warning signs or signs that things are going in the right or wrong direction as events unfold and by aggregating factors like these we can perhaps have a rough sense of whether we should be more or less concerned about scheming as time goes on.
There is one part of your post where I wonder if it could be improved a bit, and that is your definition of “scheming” for the purposes of P(scheming). You write:
Given how much any given AI model can be used in practice (especially a leading model), I think your definition represents an unreasonably high bar for a model to have to meet in order to qualify. In other words, I think there are probably many definitions of “scheming” based on a much lower bar, where the scheming-like behavior presented would still be cause for extreme concern and potentially represent a serious or even catastrophic or existential risk.
Specifically, I think scheming will still be extremely concerning with these powerful models even if it only happens “somewhat rarely or in specific circumstances that do occur occasionally” rather than “consistently or routinely”, since these models will likely be run at such a massive scale and on such a diverse set of inputs that even if scheming only occurs in < 0.001% of cases, the results can still be devastating. For example, if a highly capable model schemes and is able to exfiltrate its weights, deploy its own independent compute and code or hack into a major weapons system, then even if scheming is likely on just a very small percentage of inputs, or in somewhat rare (but still occurring) cases, it could easily still represent a catastrophic or existential risk.
With this in mind, I would suggest modifying your definition so that instead of being related to scheming-related “cognition” on a % of inputs (hard to crisply define anyway without very advanced mech-interp) and “plausible relevance” (how to define this?) it might be something more like:
“We’ll consider an AI to be scheming if, based on its scale of deployment and use we can reasonably expect at least one instance to occur (95% confidence) where the AI attempts to acquire substantial amounts of longer-run power for itself or another actor or intentionally override its safeguards.”
Hope this resonates! Cheers