Erik Jenner comments on ejenner’s Shortform

Erik Jenner 7 Apr 2024 23:58 UTC
2 points
0
Some niche thoughts on obstacles to certain mechanistic anomaly detection benchmarks:
- We might want to have benchmarks where human researchers can be part of the anomaly detector. For example, in the future, we might have a whole team of humans manually study a plan proposed by an AI to see whether it contains any measurement tampering, if it’s a sufficiently important plan. Right now, we might want to let interpretability researchers manually study what happens on individual test-time inputs and whether there’s anything anomalous about that.
- The challenge is that humans might just be able to look at the output and see whether it’s right or not, so many simple MAD tasks like detecting backdoors might be trivial in this setting.
- We could just demand that MAD algorithms need to be sufficiently cheap to run, so that only automated approaches are possible. This makes sense for some applications of MAD (namely ones that we’d want to run on many many deployment inputs, where being cheap is actually important). But you could imagine cases where in reality, we’d be totally fine with spending many person-hours on analyzing whether a single AI decision is for anomalous reasons. So it might be nice to have benchmarks that capture this setting.
- We could use some tricks to make humans unable to just notice the output is weird. For example, we could make a backdoor that only changes the logits a little bit in a very specific pattern. Or we could make a backdoor but then only test on samples where the backdoor label and the correct label happen to coincide. But these seem unfair: the philosophically right explanation on the clean data in fact still explains the output on backdoored data pretty well.
- We could try to design benchmarks in narrow domains where AIs are in fact superhuman and it’s non-trivial to tell whether an output is correct. Juggling this along with other constraints for MAD or MTD benchmarks might be tricky, but I haven’t thought about it a lot, maybe it’s already doable.
For now, I think the main reason to have benchmarks like this would be to let interpretability researchers manually decide whether something is anomalous instead of making them automate the process immediately. But it might be better to just pick the low-hanging fruit for now and only allow automated MAD algorithms. (We could still have a labeled validation set where researchers can try things out manually.)