Joe Collman comments on Interpretability Externalities Case Study—Hungry Hungry Hippos

Joe Collman 3 Oct 2023 22:40 UTC
LW: 3 AF: 2
0
AF
A couple of unconnected points:
Mostly I think that MI is right to think it can do a lot for alignment, but I suspect that lots of the best things it can do for alignment it will do in a very dual-use way, which skews heavily towards capabilities. Mostly because capabilities advances are easier and there are more people working on those.
This doesn’t clearly follow: one way for x to be easier is [there are many ways to do x, so that it’s not too hard to find one]. If it’s easy to find a few ways to get x, giving me another one may not help me at all. If it’s hard to find any way to do x, giving me a workable approach may be hugely helpful.
(I’m not making a case one way or another on the main point—I don’t know the real-world data on this, and it’s also entirely possible that the bar on alignment is so high that most/all MI isn’t useful for alignment)
I mention this as evidence for why I expect targeted approaches are faster and cheaper than ambitious ones
I’m not entirely clear I understand you here, but if I do, my response would be: targeted approaches may be faster and cheaper at solving the problems they target. Ambitious approaches are more likely to help solve problems that you didn’t know existed, and didn’t realize you needed to target.
If targeted approaches are being used for [demonstrate that problems of this kind are possible], I expect they are indeed faster and cheaper. If we’re instead talking about being used as part of an alignment solution, targeted approaches seem likely to be ~irrelevant (of course I’d be happy if I’m wrong on this!).
(again, assuming I understand how you’re using ‘targeted’ / ‘ambitious’)