I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment—for instance, most of the strategies in 11 Proposals. If we had sufficiently great interpretability, then sure, we could maybe leverage that to make a Godzilla strategy with a decent chance of working (or at least failing in detectable-in-advance ways), but with interpretability tools that good we could probably just make a plan without Godzilla have a decent chance of working (or at least failing in detectable-in-advance ways) by doing basically the same things minus Godzilla. It’s the interpretability tools which take that plan from “close to zero chance of working” to “close to 100% chance of working”; the interpretability is where all the robustness comes from. The Godzilla part adds relatively little and is plausibly net negative (due to making the ML components more complex and brittle).
(Another minor point: “adding more safeguards decreases the likelihood they’ll all fail simultaneously, as long as there isn’t a perfect correlation of failure modes” is only true when the “safeguards” are guaranteed to not increase the chance of failure.)
And—as stated, each of these safeguards might be implemented with the assistance of prior AIs in various ways.
I see two main things you could have in mind here.
First, maybe you imagine training a magic black box to tell us what’s going on inside another magic black box. This is a Godzilla strategy, and fails for the usual Godzilla reasons: errors are not recoverable. If the magic interpretability box fails, we don’t have a built-in way to notice. And no, training lots of magic black boxes to detect lots of things does not really fix the problem; the failure modes are extremely highly correlated. We don’t even need to suppose deception—just a distribution shift in the cognition of the system will cause highly correlated failures, and a distribution shift in cognition is exactly the sort of thing we’d expect from a system just starting to grok consequentialist reasoning.
On the other hand, maybe you imagine AIs serving as research assistants, rather than using AIs directly to interpret other AIs. That plan does have problems, but is basically not a Godzilla plan; the human in the loop means that the standard Godzilla brittleness issue doesn’t really apply.
I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment—for instance, most of the strategies in 11 Proposals. If we had sufficiently great interpretability, then sure, we could maybe leverage that to make a Godzilla strategy with a decent chance of working (or at least failing in detectable-in-advance ways), but with interpretability tools that good we could probably just make a plan without Godzilla have a decent chance of working (or at least failing in detectable-in-advance ways) by doing basically the same things minus Godzilla. It’s the interpretability tools which take that plan from “close to zero chance of working” to “close to 100% chance of working”; the interpretability is where all the robustness comes from. The Godzilla part adds relatively little and is plausibly net negative (due to making the ML components more complex and brittle).
(Another minor point: “adding more safeguards decreases the likelihood they’ll all fail simultaneously, as long as there isn’t a perfect correlation of failure modes” is only true when the “safeguards” are guaranteed to not increase the chance of failure.)
I see two main things you could have in mind here.
First, maybe you imagine training a magic black box to tell us what’s going on inside another magic black box. This is a Godzilla strategy, and fails for the usual Godzilla reasons: errors are not recoverable. If the magic interpretability box fails, we don’t have a built-in way to notice. And no, training lots of magic black boxes to detect lots of things does not really fix the problem; the failure modes are extremely highly correlated. We don’t even need to suppose deception—just a distribution shift in the cognition of the system will cause highly correlated failures, and a distribution shift in cognition is exactly the sort of thing we’d expect from a system just starting to grok consequentialist reasoning.
On the other hand, maybe you imagine AIs serving as research assistants, rather than using AIs directly to interpret other AIs. That plan does have problems, but is basically not a Godzilla plan; the human in the loop means that the standard Godzilla brittleness issue doesn’t really apply.