We now think that this post underemphasized the extent to which we think that control should focus on the lead-up to the first time an AI gets caught. So I added this paragraph to the introduction:
We think the control line-of-defense should mostly focus on the lead-up to the first time an AI gets caught trying to cause bad outcomes. So we’d aim to ensure the property “if an AI is scheming, it won’t successfully cause a catastrophe before we catch it”. That is, it’s acceptable if either no catastrophic outcomes occur (but the AI never gets caught), or the AI gets caught. So when evaluating control, we should count catching an AI red-handed as a win condition. Our core argument for this is that once the AI is caught the situation changes substantially; most importantly, it’s now clear that there are serious misalignment problems and thus shutting down until we have credible solutions is far more likely. We discuss these arguments in more detail in our recent post Catching AIs red-handed.
I also made a few other tiny edits to make this emphasis more clear.
We now think that this post underemphasized the extent to which we think that control should focus on the lead-up to the first time an AI gets caught. So I added this paragraph to the introduction:
I also made a few other tiny edits to make this emphasis more clear.