I’m a researcher on the technical governance team at MIRI.
Views expressed are my own, and should not be taken to represent official MIRI positions. Similarly, views within the technical governance team do vary.
Previously:
Helped with MATS, running the technical side of the London extension (pre-LISA).
Worked for a while on Debate (this kind of thing).
Quick takes on the above:
I think MATS is great-for-what-it-is. My misgivings relate to high-level direction.
Worth noting that PIBBSS exists, and is philosophically closer to my ideal.
The technical AISF course doesn’t have the emphasis I’d choose (which would be closer to Key Phenomena in AI Risk). It’s a decent survey of current activity, but only implicitly gets at fundamentals—mostly through a [notice what current approaches miss, and will continue to miss] mechanism.
I don’t expect research on Debate, or scalable oversight more generally, to help significantly in reducing AI x-risk. (I may be wrong! - some elaboration in this comment thread)
This seems a helpful model—so long as it’s borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn’t a guarantee.
Thoughts:
It’s not clear to me whether the following counts as a rogue deployment (I’m assuming so):
[un-noticed failure of one safety measure, in a context where all other safety measures are operational]
For this kind of case:
The name “rogue deployment” doesn’t seem a great fit.
In general, it’s not clear to me how to draw the line between:
Safety measure x didn’t achieve what we wanted, because it wasn’t specified/implemented sufficiently well. (not a rogue deployment)
Safety measure x was subverted. (rogue deployment)
For example, I think it’d be reasonable to think of [Amazing, long-term jailbreaks] as rogue deployments on this basis: the jailbreak is subverting a safety measure, so that “the safety measures are absent” is true in some sense.
It seems important to distinguish things like:
This safety measure appears to be in effect.
This safety measure is running as designed.
We’re getting the safety-improving-property we wanted from this safety measure.
When considering the [Employees of the AI company might run the model in an unauthorized way] case,
I think one central example to consider is of an employee who:
Thinks this is a good idea for the world.
Can make a pretty compelling case to others that it’s a good idea.
The examples in the post seem to focus on [bad intent and/or incompetence], which seems important, but too limited.