I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
I wish the original post had been more careful about its claims, so that I could respond to them more clearly. Instead there’s a mishmash of sensible arguments, totally unjustified assertions, and weird strawmen (like “I don’t understand how “Looking at random bits of the model and identify circuits/features” will help with deception”). And in general a lot of this is of the form “I don’t see how X”, which is the format I’m objecting to, because of course you won’t see how X until someone invents a technique to X.
This is exacerbated by the meta-level problem that people have very different standards for what’s useful (e.g. to Eliezer, none of this is useful), and also standards for what types of evidence and argument they accept (e.g. to many ML researchers, approximately all arguments about long-term theories of impact are too speculative to be worth engaging in depth).
I still think that so many people are working on interpretability mainly because they don’t see alternatives that are as promising; in general I’d welcome writing that clearly lays out solid explanations and intuitions about why those other research directions are worth working on, and think that this would be the best way to recalibrate the field.
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don’t think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
I wish the original post had been more careful about its claims, so that I could respond to them more clearly. Instead there’s a mishmash of sensible arguments, totally unjustified assertions, and weird strawmen (like “I don’t understand how “Looking at random bits of the model and identify circuits/features” will help with deception”). And in general a lot of this is of the form “I don’t see how X”, which is the format I’m objecting to, because of course you won’t see how X until someone invents a technique to X.
This is exacerbated by the meta-level problem that people have very different standards for what’s useful (e.g. to Eliezer, none of this is useful), and also standards for what types of evidence and argument they accept (e.g. to many ML researchers, approximately all arguments about long-term theories of impact are too speculative to be worth engaging in depth).
I still think that so many people are working on interpretability mainly because they don’t see alternatives that are as promising; in general I’d welcome writing that clearly lays out solid explanations and intuitions about why those other research directions are worth working on, and think that this would be the best way to recalibrate the field.
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don’t think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.