This seems a helpful model—so long as it’s borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn’t a guarantee.
Thoughts:
It’s not clear to me whether the following counts as a rogue deployment (I’m assuming so):
[un-noticed failure of one safety measure, in a context where all other safety measures are operational]
For this kind of case:
The name “rogue deployment” doesn’t seem a great fit.
In general, it’s not clear to me how to draw the line between:
Safety measure x didn’t achieve what we wanted, because it wasn’t specified/implemented sufficiently well. (not a rogue deployment)
Safety measure x was subverted. (rogue deployment)
For example, I think it’d be reasonable to think of [Amazing, long-term jailbreaks] as rogue deployments on this basis: the jailbreak is subverting a safety measure, so that “the safety measures are absent” is true in some sense.
It seems important to distinguish things like:
This safety measure appears to be in effect.
This safety measure is running as designed.
We’re getting the safety-improving-property we wanted from this safety measure.
When considering the [Employees of the AI company might run the model in an unauthorized way] case,
I think one central example to consider is of an employee who:
Thinks this is a good idea for the world.
Can make a pretty compelling case to others that it’s a good idea.
The examples in the post seem to focus on [bad intent and/or incompetence], which seems important, but too limited.
The distinction I’m trying to draw is: did the model end up getting deployed in a way that involves a different fundamental system architecture than the one we’d designed?
It still seems to me that there’s a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.
In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].
Granted, by default I’d expect [compromised safety measure(s)] → [rogue deployment] → [catastrophe] Rather than [compromised safety measure(s)] → [catastrophe].
However, supposing we had a robust means to avoid rogue deployment, I don’t think it’s clear that we’d be left with good odds to catch sketchy behavior. That may be true if we had an independent chance to notice sketchiness for each action—but I’d expect there to be strategies where [our failure to notice sketchiness] correlates strongly over many actions. A rogue deployment is one such strategy, but I’d be surprised if it’s the only one.
This seems a helpful model—so long as it’s borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn’t a guarantee.
Thoughts:
It’s not clear to me whether the following counts as a rogue deployment (I’m assuming so):
[un-noticed failure of one safety measure, in a context where all other safety measures are operational]
For this kind of case:
The name “rogue deployment” doesn’t seem a great fit.
In general, it’s not clear to me how to draw the line between:
Safety measure x didn’t achieve what we wanted, because it wasn’t specified/implemented sufficiently well. (not a rogue deployment)
Safety measure x was subverted. (rogue deployment)
For example, I think it’d be reasonable to think of [Amazing, long-term jailbreaks] as rogue deployments on this basis: the jailbreak is subverting a safety measure, so that “the safety measures are absent” is true in some sense.
It seems important to distinguish things like:
This safety measure appears to be in effect.
This safety measure is running as designed.
We’re getting the safety-improving-property we wanted from this safety measure.
When considering the [Employees of the AI company might run the model in an unauthorized way] case,
I think one central example to consider is of an employee who:
Thinks this is a good idea for the world.
Can make a pretty compelling case to others that it’s a good idea.
The examples in the post seem to focus on [bad intent and/or incompetence], which seems important, but too limited.
Re 1, I don’t count that as a rogue deployment.
The distinction I’m trying to draw is: did the model end up getting deployed in a way that involves a different fundamental system architecture than the one we’d designed?
Agreed re 2.
Ah okay, that’s clarifying. Thanks.
It still seems to me that there’s a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.
In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].
Granted, by default I’d expect [compromised safety measure(s)] → [rogue deployment] → [catastrophe]
Rather than [compromised safety measure(s)] → [catastrophe].
However, supposing we had a robust means to avoid rogue deployment, I don’t think it’s clear that we’d be left with good odds to catch sketchy behavior.
That may be true if we had an independent chance to notice sketchiness for each action—but I’d expect there to be strategies where [our failure to notice sketchiness] correlates strongly over many actions. A rogue deployment is one such strategy, but I’d be surprised if it’s the only one.