It still seems to me that there’s a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.
In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].
Granted, by default I’d expect [compromised safety measure(s)] → [rogue deployment] → [catastrophe] Rather than [compromised safety measure(s)] → [catastrophe].
However, supposing we had a robust means to avoid rogue deployment, I don’t think it’s clear that we’d be left with good odds to catch sketchy behavior. That may be true if we had an independent chance to notice sketchiness for each action—but I’d expect there to be strategies where [our failure to notice sketchiness] correlates strongly over many actions. A rogue deployment is one such strategy, but I’d be surprised if it’s the only one.
Ah okay, that’s clarifying. Thanks.
It still seems to me that there’s a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.
In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].
Granted, by default I’d expect [compromised safety measure(s)] → [rogue deployment] → [catastrophe]
Rather than [compromised safety measure(s)] → [catastrophe].
However, supposing we had a robust means to avoid rogue deployment, I don’t think it’s clear that we’d be left with good odds to catch sketchy behavior.
That may be true if we had an independent chance to notice sketchiness for each action—but I’d expect there to be strategies where [our failure to notice sketchiness] correlates strongly over many actions. A rogue deployment is one such strategy, but I’d be surprised if it’s the only one.