Thanks, I’ll summarize my takeaways in the form of an argument that control is sufficient to safely accelerate alignment research by 30x. This is certainly not the only argument one could make, but I like it because it is clear and concrete.
I’ll specifically focus on justifying that the red team is competitive.
If your AI systems can accelerate AI safety research by 30x, then they are producing effectively superhuman outputs; however, this does not imply they can produce superhuman evasion plans because AI systems can’t spend as much of their thinking on how to evade control measures as they can on safety research.
Suppose that the AI systems have approximately human capabilities and are sped up by 30x.
Suppose also that the human red team has spent 2 months identifying evasion strategies (call this ‘coup research’ following your control paper)
I’ll argue that the AI systems will not be able to do more than the equivalent of 2 months of coup research. Since our red team has lots of advantages over the AI systems, this is a conservative argument that the red team is competitive.
Suppose that we only want to justify that the red team is competitive for the next two months (and we’ll continuously search for attack strategies in order to maintain this 1 to 1 ratio).
Two months of deployment is equivalent to 5 years of human thinking since our AI system is sped up by 30x. We need to show that no more than 1 / 30th of this thinking is coup research.
Training algorithms might pressure AI systems to be at least somewhat efficient about doing useful work (and coup research isn’t useful for the task at hand). Perhaps this implies a 1 / N coup research budget.
Furthermore, suppose the AI systems’ reasoning is externalized. Monitoring setups are very likely to catch coup research if it is present at a frequency greater than 1 / M.
min(1 / N, 1 / M) ⇐ 1 / 30
I think I buy this argument or some variation of it.
Thanks, I’ll summarize my takeaways in the form of an argument that control is sufficient to safely accelerate alignment research by 30x. This is certainly not the only argument one could make, but I like it because it is clear and concrete.
I’ll specifically focus on justifying that the red team is competitive.
If your AI systems can accelerate AI safety research by 30x, then they are producing effectively superhuman outputs; however, this does not imply they can produce superhuman evasion plans because AI systems can’t spend as much of their thinking on how to evade control measures as they can on safety research.
Suppose that the AI systems have approximately human capabilities and are sped up by 30x.
Suppose also that the human red team has spent 2 months identifying evasion strategies (call this ‘coup research’ following your control paper)
I’ll argue that the AI systems will not be able to do more than the equivalent of 2 months of coup research. Since our red team has lots of advantages over the AI systems, this is a conservative argument that the red team is competitive.
Suppose that we only want to justify that the red team is competitive for the next two months (and we’ll continuously search for attack strategies in order to maintain this 1 to 1 ratio).
Two months of deployment is equivalent to 5 years of human thinking since our AI system is sped up by 30x. We need to show that no more than 1 / 30th of this thinking is coup research.
Training algorithms might pressure AI systems to be at least somewhat efficient about doing useful work (and coup research isn’t useful for the task at hand). Perhaps this implies a 1 / N coup research budget.
Furthermore, suppose the AI systems’ reasoning is externalized. Monitoring setups are very likely to catch coup research if it is present at a frequency greater than 1 / M.
min(1 / N, 1 / M) ⇐ 1 / 30
I think I buy this argument or some variation of it.