You’re totally right that this is an important difficulty I glossed over, thanks!
TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can’t supervise, and this ingredient could be interpretability. On the other hand, there’s at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming).
Just to restate things a bit, I’d distinguish two cases:
“In-distribution anomaly detection:” we are fine with flagging any input as “anomalous” that’s OOD compared to the trusted distribution
“Off-distribution anomaly detection:” there are some inputs that are OOD but that we still want to classify as “normal”
In-distribution anomaly detection can already be useful (mainly to deal with rare high-stakes failures). For example, if a human can verify that no tampering occurred with enough effort, then we might be able to create a trusted distribution that covers so many cases that we’re fine with flagging everything that’s OOD.
But we might still want off-distribution anomaly detection, where the anomaly detector generalizes as intended from easy trusted examples to harder untrusted examples. Then we need some additional ingredient to make that generalization work. Paul writes about one approach specifically for measurement tampering here and in the following subsection. Exlusion finetuning (appendix I in Redwood’s measurement tampering paper) is a practical implementation of a similar intuition. This does rely on some assumptions about inductive bias, but at least seems more promising to me than just hoping to get a direct translator from normal training.
I think ARC might have hopes to solve ELK more broadly (rather than just measurement tampering), but I understand those less (and maybe they’re just “use a measurement tampering detector to bootstrap to a full ELK solution”).
To be clear, I’m far from confident that approaches like this will work, but getting to the point where we could solve measurement tampering via interp also seems speculative in the foreseeable future. These two bets seem at least not perfectly correlated, which is nice.
You’re totally right that this is an important difficulty I glossed over, thanks!
TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can’t supervise, and this ingredient could be interpretability. On the other hand, there’s at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming).
Just to restate things a bit, I’d distinguish two cases:
“In-distribution anomaly detection:” we are fine with flagging any input as “anomalous” that’s OOD compared to the trusted distribution
“Off-distribution anomaly detection:” there are some inputs that are OOD but that we still want to classify as “normal”
In-distribution anomaly detection can already be useful (mainly to deal with rare high-stakes failures). For example, if a human can verify that no tampering occurred with enough effort, then we might be able to create a trusted distribution that covers so many cases that we’re fine with flagging everything that’s OOD.
But we might still want off-distribution anomaly detection, where the anomaly detector generalizes as intended from easy trusted examples to harder untrusted examples. Then we need some additional ingredient to make that generalization work. Paul writes about one approach specifically for measurement tampering here and in the following subsection. Exlusion finetuning (appendix I in Redwood’s measurement tampering paper) is a practical implementation of a similar intuition. This does rely on some assumptions about inductive bias, but at least seems more promising to me than just hoping to get a direct translator from normal training.
I think ARC might have hopes to solve ELK more broadly (rather than just measurement tampering), but I understand those less (and maybe they’re just “use a measurement tampering detector to bootstrap to a full ELK solution”).
To be clear, I’m far from confident that approaches like this will work, but getting to the point where we could solve measurement tampering via interp also seems speculative in the foreseeable future. These two bets seem at least not perfectly correlated, which is nice.