Wei Dai comments on Mechanistic Transparency for Machine Learning

Wei Dai 4 Sep 2019 3:19 UTC
LW: 14 AF: 6
AF
In programming, it’s often easier to write new code from scratch than to try to understand someone else’s code, especially if the other person’s code is optimized for something other than human-understandability. See here for an example, where I wrote:

Many of the algorithms and tables used here came from the deflate implementation by Jean-loup Gailly, which was included in Crypto++ 4.0 and earlier. I completely rewrote it in order to fix a bug that I could not figure out. This code is less clever, but hopefully more understandable and maintainable.

Since human-understandability is costly to evaluate (and hence to train), and also costly in terms of causing lower performance on other metrics (note that code that I wrote to be more understandable is significantly slower than the original code), I have strong doubts about this line of research.

My guess is that if you took a human-level AGI that was the result of something like deep learning optimizing only for capability (and not understandability), and tried to interpret it as pseudocode, you’ll end up with so many modules with so many interactions between them that no human or team of humans could understand it. In other words, you’ll end up with spaghetti code written by a superintelligence (meaning the training process).

If you instead tried to optimize for both capability and understandability at the same time, you have a much harder ML problem on your hands, maybe even an impossible one.

Perhaps if an AGI is built out of modules that are separately trained, instead of being trained end-to-end, you could use this idea on some of the smaller modules that are especially important to safety. I’m curious if that’s the kind of plan you have in mind, or if you’re more ambitious about this approach.
What links here?
- DanielFilan 2 Mar 2020 19:25 UTC
  LW: 4 AF: 3
  AF Parent
  This response is rather late, but basically my hope is that it’s possible to optimise for understandability by regularising for some relatively simple quantity that induces understandability.
  
  Perhaps if an AGI is built out of modules that are separately trained, instead of being trained end-to-end, you could use this idea on some of the smaller modules that are especially important to safety. I’m curious if that’s the kind of plan you have in mind, or if you’re more ambitious about this approach.
  
  I’m more ambitious, and fear that that might not work: either you train a bunch of ‘small’ things that do very concrete tasks, and aren’t quite sure how to combine them to create AGI (or you have to combine a huge number of them and hope that errors don’t cascade), or you train a few large ones that do big, complicated tasks that themselves are hard to interpret. That being said, the first branch would satisfy my desiderata for the approach, and I’d hope some people are working on it.