Christopher Olah comments on Chris Olah’s views on AGI safety

Christopher Olah 4 Nov 2019 22:13 UTC
LW: 21 AF: 13
AF
Thanks for making that distinction, Steve. I think the reason things might sounds muddled is that many people expect that (1) will drive (2).

Why might one expect (1) to cause (2)? One way to think about it is that, right now, most ML experiments optimistically given 1-2 bits of feedback to the researcher, in the form of whether their loss went up or down from a baseline. If we understand the resulting model, however, that could produce orders of magnitude more meaningful feedback about each experiment. As a concrete example, in InceptionV1, there are a cluster of neurons responsible for detecting 3D curvature and geometry that all form together in one very specific place. It’s pretty suggestive that, if you wanted your model to have a better understanding of 3D curvature, you could add neurons there. So that’s an example where richer feedback could, hypothetically, guide you.

Of course, it’s not actually clear how helpful it is! We spent a bunch of time thinking about the model and concluded “maybe it would be especially useful on a particular dimension to add neurons here.” Meanwhile, someone else just went ahead and randomly added a bunch of new layers and tried a dozen other architectural tweaks, producing much better results. This is what I mean about it actually being really hard to outcompete the present ML approach.

There’s another important link between (1) and (2). Last year, I interviewed a number of ML researchers I respect at leading groups about what would make them care about interpretability. Almost uniformly, the answer was that they wanted interpretability to give them actionable steps for improving their model. This has led me to believe that interpretability will accelerate a lot if it can help with (2), but that’s also the point at which it helps capabilities.