Nice, I like this concept of rogue deployment as it highlights two distinct features that are both required for a safety method to be considered ‘successful’. I’m understanding catastrophe with rogue deployment as having good enough safety measures but these safety measures were bypassed/turned off, whereas catastrophe without rogue deployment involves having safety measures that were fully operational the whole time but insufficient to prevent a model/human actor from causing a catastrophe.
So for example, we could get really great mech. interp tools, but avoiding catastrophe isn’t guaranteed if all of these mech. interp tools are running on a single server (making them very easy to disable). To prevent rogue deployment we’d want multiple servers running these mech. interp tools to provide redundancy in case one goes down/gets hacked etc. So there’s a concept here of the raw effectiveness of a safety method as well as its reliability. I’m sure others can probably think of more nuanced examples too.
Nice, I like this concept of rogue deployment as it highlights two distinct features that are both required for a safety method to be considered ‘successful’. I’m understanding catastrophe with rogue deployment as having good enough safety measures but these safety measures were bypassed/turned off, whereas catastrophe without rogue deployment involves having safety measures that were fully operational the whole time but insufficient to prevent a model/human actor from causing a catastrophe.
So for example, we could get really great mech. interp tools, but avoiding catastrophe isn’t guaranteed if all of these mech. interp tools are running on a single server (making them very easy to disable). To prevent rogue deployment we’d want multiple servers running these mech. interp tools to provide redundancy in case one goes down/gets hacked etc. So there’s a concept here of the raw effectiveness of a safety method as well as its reliability. I’m sure others can probably think of more nuanced examples too.