Wei Dai comments on Chris Olah’s views on AGI safety

Wei Dai 1 Nov 2019 21:14 UTC
LW: 19 AF: 10
AF

The model that Chris has here is something like a reverse compilation process that turns a neural network into human-understable code. Chris notes that the resulting code might be truly gigantic—e.g. the entire Linux kernel—but that it would be faithful to the model and understandable by humans.

Does “faithful” mean “100% identical in terms of I/O”, or more like “captures all of the important elements of”? My understanding is that neural networks are continuous whereas human-understandable code like the Linux kernel are discrete, so it seemingly just can’t work in the former case, and I’m not sure how it can work in the latter case either.

Do you or Chris think that a test of this might be to take a toy model (say a 100-neuron ANN) that solves some toy problem, and see if it can be reversed compiled? (Or let me know if this has already been done.) If not, what’s the earliest meaningful test that can be done?

I’m also concerned that combining ML, reverse compilation, and “giving feedback on process” essentially equals programming by nudging which just seems like a really inefficient way of programming. ETA: Is there an explanation of why this kind of ML would be better (in any sense of that word) than someone starting with a random piece of code and trying to end up with an AI by modifying it a little bit at a time?

ETA2: I wonder if Chris is assuming some future ML technique that learns a lot faster (i.e., is much more sample efficient) than what we have today, so that humans wouldn’t have to give a lot of feedback on process, and “programming by nudging” wouldn’t be a good analogy anymore.
What links here?
- Wei Dai's comment on AI Alignment Open Thread October 2019 by habryka (3 Nov 2019 18:54 UTC; 18 points)
- Gabriel Goh 5 Nov 2019 20:40 UTC
  LW: 26 AF: 11
  AF Parent
  <note> I work in Clarity at OpenAI. Chris and I have discussed this response (though I cannot claim to represent him).
  Does “faithful” mean “100% identical in terms of I/O”, or more like “captures all of the important elements of”?
  I’d say faithfulness lies on a spectrum. Full IO determinism on a neural network is nearly impossible (given the vagaries of floating point arithmetic), but what is really of interest to us is “effectively identical IO”. A working definition of this could be—an interpretable network that one that can act as a drop-in replacement the original network with no impact on final accuracy.
  This allows some wiggle room in the weights—to be rounded up and down, and allow weights that do not have any impact on final accuracy can be ablated. We are not, however, interested in creating interpretable approximations of the original network.
  My understanding is that neural networks are continuous whereas human-understandable code like the Linux kernel are discrete, so it seemingly just can’t work in the former case, and I’m not sure how it can work in the latter case either.
  We are reasoning explicitly about numerical code, but I would argue this isn’t that alien to human comprehension! Discrete code may be more intuitive (sometimes), but human cognition is certainly capable of understanding numerical algorithms (think of say, the SIFT algorithm)!
  Do you or Chris think that a test of this might be to take a toy model (say a 100-neuron ANN) that solves some toy problem, and see if it can be reversed compiled? (Or let me know if this has already been done.) If not, what’s the earliest meaningful test that can be done?
  We are indeed working through this on a fairly sophisticated vision model. We’re making progress!