Steven Byrnes comments on Chris Olah’s views on AGI safety

Steven Byrnes 3 Nov 2019 13:10 UTC
LW: 12 AF: 8
AF
We should be careful to separate two levels of understanding: (1) We can understand the weights and activations of a particular trained model, versus (2) We can understand why a particular choice of architecture, learning algorithm, and hyperparameters is a good (effective) choice for a given ML application.

I think that (1) is great for AGI safety, (2) does a lot for capabilities and not much for safety.

So bringing up Neural Architecture Search is not necessarily the most relevant thing, since NAS is about (2), not (1).

For my part, I’m expecting that the community will “by default” make progress on (2), such that researchers using Neural Architecture Search will naturally be outcompeted by researchers who understand why to use a certain architecture and hyperparameters. Whereas I feel like (1) is the very important thing that won’t necessarily happen automatically, unless people like Chris Olah keep doing the hard work to make it a community priority.
- Christopher Olah 4 Nov 2019 22:13 UTC
  LW: 21 AF: 13
  AF Parent
  Thanks for making that distinction, Steve. I think the reason things might sounds muddled is that many people expect that (1) will drive (2).
  
  Why might one expect (1) to cause (2)? One way to think about it is that, right now, most ML experiments optimistically given 1-2 bits of feedback to the researcher, in the form of whether their loss went up or down from a baseline. If we understand the resulting model, however, that could produce orders of magnitude more meaningful feedback about each experiment. As a concrete example, in InceptionV1, there are a cluster of neurons responsible for detecting 3D curvature and geometry that all form together in one very specific place. It’s pretty suggestive that, if you wanted your model to have a better understanding of 3D curvature, you could add neurons there. So that’s an example where richer feedback could, hypothetically, guide you.
  
  Of course, it’s not actually clear how helpful it is! We spent a bunch of time thinking about the model and concluded “maybe it would be especially useful on a particular dimension to add neurons here.” Meanwhile, someone else just went ahead and randomly added a bunch of new layers and tried a dozen other architectural tweaks, producing much better results. This is what I mean about it actually being really hard to outcompete the present ML approach.
  
  There’s another important link between (1) and (2). Last year, I interviewed a number of ML researchers I respect at leading groups about what would make them care about interpretability. Almost uniformly, the answer was that they wanted interpretability to give them actionable steps for improving their model. This has led me to believe that interpretability will accelerate a lot if it can help with (2), but that’s also the point at which it helps capabilities.