Across datasets, we find that each tree-regularized deep time-series model has predictions that agree with its corresponding decision tree proxy in about 85-90% of test examples.
How to interpret what the neural network is doing the rest of the time?
Do we need to know that, or is 85-90% agreement with a decision tree proxy good enough? I guess it depends on how interpretability is going to be used to achieve overall safety. Do you have a view on that?
In the case of tree regularization, it’s hard to see how we could use it to benefit alignment. Luckily, I’m not endorsing tree regularization as a specific safety strategy.
Instead, I see it more like this: how can we encourage our models to be easy to inspect? In a way, it’s not clear that we need to actually extract the exact penalty-inducing algorithm (that is, the known-algorithm which is used to generate the penalty term). We just want our model to be doing the type of things that would make it easier rather than harder to peek inside and see what’s going on, so if something starts going wrong, we know why.
The hope is that including a regularization penalty can allow us to create such a model without sacrificing performance too heavily.
We just want our model to be doing the type of things that would make it easier rather than harder to peek inside and see what’s going on, so if something starts going wrong, we know why.
To generalize my question, what if something goes wrong, we peek inside and find out that it’s one of the 10-15% of times when the model doesn’t agree with the known-algorithm which is used to generate the penalty term?
I interpreted your question differently than you probably wanted me to interpret it. From my perspective, we are hoping for greater transparency as an end result, rather than treating it as “similar enough” to some other algorithm and using the other algorithm to interpret it.
If I wanted to answer your generalized question within the context of comparing it to the known algorithm, I’d have to think for much longer. I don’t have a good response on hand.
The paper says:
How to interpret what the neural network is doing the rest of the time?
Do we need to know that, or is 85-90% agreement with a decision tree proxy good enough? I guess it depends on how interpretability is going to be used to achieve overall safety. Do you have a view on that?
In the case of tree regularization, it’s hard to see how we could use it to benefit alignment. Luckily, I’m not endorsing tree regularization as a specific safety strategy.
Instead, I see it more like this: how can we encourage our models to be easy to inspect? In a way, it’s not clear that we need to actually extract the exact penalty-inducing algorithm (that is, the known-algorithm which is used to generate the penalty term). We just want our model to be doing the type of things that would make it easier rather than harder to peek inside and see what’s going on, so if something starts going wrong, we know why.
The hope is that including a regularization penalty can allow us to create such a model without sacrificing performance too heavily.
To generalize my question, what if something goes wrong, we peek inside and find out that it’s one of the 10-15% of times when the model doesn’t agree with the known-algorithm which is used to generate the penalty term?
I interpreted your question differently than you probably wanted me to interpret it. From my perspective, we are hoping for greater transparency as an end result, rather than treating it as “similar enough” to some other algorithm and using the other algorithm to interpret it.
If I wanted to answer your generalized question within the context of comparing it to the known algorithm, I’d have to think for much longer. I don’t have a good response on hand.