Wei Dai comments on One Way to Think About ML Transparency

Wei Dai 3 Sep 2019 1:55 UTC
LW: 4 AF: 3
AF
The paper says:

Across datasets, we find that each tree-regularized deep time-series model has predictions that agree with its corresponding decision tree proxy in about 85-90% of test examples.

How to interpret what the neural network is doing the rest of the time?

Do we need to know that, or is 85-90% agreement with a decision tree proxy good enough? I guess it depends on how interpretability is going to be used to achieve overall safety. Do you have a view on that?
What links here?
- Wei Dai's comment on AI Alignment Open Thread October 2019 by habryka (3 Nov 2019 18:54 UTC; 18 points)
- Matthew Barnett 3 Sep 2019 2:34 UTC
  3 points
  Parent
  In the case of tree regularization, it’s hard to see how we could use it to benefit alignment. Luckily, I’m not endorsing tree regularization as a specific safety strategy.
  Instead, I see it more like this: how can we encourage our models to be easy to inspect? In a way, it’s not clear that we need to actually extract the exact penalty-inducing algorithm (that is, the known-algorithm which is used to generate the penalty term). We just want our model to be doing the type of things that would make it easier rather than harder to peek inside and see what’s going on, so if something starts going wrong, we know why.
  The hope is that including a regularization penalty can allow us to create such a model without sacrificing performance too heavily.
  - Wei Dai 3 Sep 2019 23:13 UTC
    3 points
    Parent
    
    We just want our model to be doing the type of things that would make it easier rather than harder to peek inside and see what’s going on, so if something starts going wrong, we know why.
    
    To generalize my question, what if something goes wrong, we peek inside and find out that it’s one of the 10-15% of times when the model doesn’t agree with the known-algorithm which is used to generate the penalty term?
    - Matthew Barnett 3 Sep 2019 23:21 UTC
      7 points
      Parent
      I interpreted your question differently than you probably wanted me to interpret it. From my perspective, we are hoping for greater transparency as an end result, rather than treating it as “similar enough” to some other algorithm and using the other algorithm to interpret it.
      
      If I wanted to answer your generalized question within the context of comparing it to the known algorithm, I’d have to think for much longer. I don’t have a good response on hand.