The idea of dividing failure stories into “failures involving rogue deployments” and “other failures” seems most useful if the following argument goes through: 1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor 2. Either this labor is done by AIs in approved scaffolds, or it is done in “rogue deployments” 3. Hence the only easy-by-default disaster route is through a rogue deployment 4. Hence if we rule out rogue deployments and very impressive/difficult malicious labor in our scaffolds, we are safe
This seems true for most catastrophes which only need outsider affordances (such as “order the proteins etc for a pandemic or a nanofactory”). Catastrophes with outsider affordances can usually be presumed difficult because rogue humans don’t randomly end the world.
However, this argument for #1 doesn’t work for routes which use insider affordances or responsibilities. Nor for routes which go through work done by future generations of AIs. The central case of both is routes that involve using insider privileges to backdoor or intentionally misalign future AIs. For example: - Planting backdoors in the codebase which allow safety training code to be turned off or misconfigured at a later time - Poisoning datasets (idk if insider access helps though) - Intentionally creating and defending flawed alignment plans
These routes get future generations of AIs to do the difficult work required for the final catastrophe. But they aren’t necessarily rogue deployments. Consider a backdoored future AI running only in authorized scaffolds, where the safety case for the future AI was an alignment case (not a control case). The alignment case fails due to the subversive actions of the earlier AI. So the earlier AI has caused a catastrophe without doing the difficult work of directly taking over the world, and also without a rogue deployment.
One could separately argue that these routes are also “fundamentally hard” (even if not as hard as directly causing a catastrophe), but I don’t see a clear blanket reason.
The idea of dividing failure stories into “failures involving rogue deployments” and “other failures” seems most useful if the following argument goes through
Hmm, I agree this division would be more useful if this argument went through, but I think it is quite useful even without this and this worth noting. (And indeed the post doesn’t make this argument and discusses subtle manipulation.)
I think subtle manipulation is a reasonably plausible threat model.
The idea of dividing failure stories into “failures involving rogue deployments” and “other failures” seems most useful if the following argument goes through:
1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor
2. Either this labor is done by AIs in approved scaffolds, or it is done in “rogue deployments”
3. Hence the only easy-by-default disaster route is through a rogue deployment
4. Hence if we rule out rogue deployments and very impressive/difficult malicious labor in our scaffolds, we are safe
This seems true for most catastrophes which only need outsider affordances (such as “order the proteins etc for a pandemic or a nanofactory”). Catastrophes with outsider affordances can usually be presumed difficult because rogue humans don’t randomly end the world.
However, this argument for #1 doesn’t work for routes which use insider affordances or responsibilities. Nor for routes which go through work done by future generations of AIs. The central case of both is routes that involve using insider privileges to backdoor or intentionally misalign future AIs. For example:
- Planting backdoors in the codebase which allow safety training code to be turned off or misconfigured at a later time
- Poisoning datasets (idk if insider access helps though)
- Intentionally creating and defending flawed alignment plans
These routes get future generations of AIs to do the difficult work required for the final catastrophe. But they aren’t necessarily rogue deployments. Consider a backdoored future AI running only in authorized scaffolds, where the safety case for the future AI was an alignment case (not a control case). The alignment case fails due to the subversive actions of the earlier AI. So the earlier AI has caused a catastrophe without doing the difficult work of directly taking over the world, and also without a rogue deployment.
One could separately argue that these routes are also “fundamentally hard” (even if not as hard as directly causing a catastrophe), but I don’t see a clear blanket reason.
Hmm, I agree this division would be more useful if this argument went through, but I think it is quite useful even without this and this worth noting. (And indeed the post doesn’t make this argument and discusses subtle manipulation.)
I think subtle manipulation is a reasonably plausible threat model.