If you were told that you could study the weight matrices of a neural network for one hour, and then after that hour you’d be given pen and paper and a bunch of testing data, I don’t think you’d be able to classify images well (unless the network was very small and so you could memorize the weights). By contrast, one can study MCTS for one hour and can use it to play games. That’s the relevant distinction, though I perhaps phrased it poorly.
I agree that this is similar to compressibility. I will have to think longer for whether they are similar enough for a distinction to be meaningless.
Ah, gotcha. I think this is a bit different to compressibility: if you formalise it as Kolmogorov complexity, then you can have a very compressible algorithm that in fact you can’t compress given time limits, because it’s too hard to figure out how to compress it. This seems more like ‘de facto compressibility’, which might be formalised using the speed prior or a variant.
Looking back on this, the relevant notion isn’t going to be those distributions, but just plain old Kt complexity: the minimum over programs p that take time t to compute the data of len(p)+log(t).
I’m still not sure about the distinction. A human with an arbitrarily large amount of time & paper could “train” a new NN (instead of working to “extract a decision tree”), and then “use” that NN.
Unless the human could memorize the initialization parameters, they would be using a different neural network to classify. In other words, the training procedure might be simulatable, but not the actual neural networks.
Rather than thinking about simulatability in a Turing sense, ask whether it makes sense in an informal sense: would you know what to do in order to operate the algorithm on paper.
Unless the human could memorize the initialization parameters, they would be using a different neural network to classify.
Why wouldn’t the “human-trained network” be identical to the “original network”? [EDIT: sorry, missed your point. If the human knows the logic of the random number generator that was used to initialize the parameters of the original network, they can manually run the same logic themselves.]
By the same logic, any decision tree that is too large for a human to memorize does not allow theory simulatability as defined in the OP.
If the human knows the logic of the random number generator that was used to initialize the parameters of the original network, they have no problem to manually run the same logic themselves.
Presumably the random seed is going to be big and complicated.
It also wouldn’t just be the random seed you’d need to memorize. You’d have to memorize all the training data and labels too. The scenario I gave was intended to be one where you were asked to study an algorithm for a while, and then afterwards operate it on a blank sheet of paper, given testing data. Even if you had the initialization seed, you wouldn’t just be able to spin up neural networks to make predictions...
I have little to say except that I didn’t want people to take this definition too seriously and now it looks like I’ve done a bad job communicating what I meant. In my head I have a simple notion corresponding to “could I operate this algorithm on a piece of paper if I was asked to study it for a while beforehand.” This is different in my mind from asking whether it is possible to do so on a Turing machine.
But like, I could not operate a large decision tree on a piece of paper if I could study it for a while beforehand, because I wouldn’t remember all of the yes/no questions and their structure.
I could certainly build a decision tree given data, but I could also build a neural network given data.
(Well, actually I think large decision trees and neural nets are both uninterpretable, so I mostly do agree with your definition. I object to having this definition of interpretability and believing that decision trees are interpretable.)
Yes, I wrote in the main post that the authors of tree regularization penalized large trees. Certainly small decision trees are interpretable, and large trees much less so.
Oh, sure, but if you train on a complex task like image classification, you’re only going to get large decision trees (assuming you get decent accuracy), even with regularization.
(Also, why not just use the decision tree if it’s interpretable? Why bother with a neural net at all?)
The hope is that we can design some regularization scheme that provides us the performance of neural networks while providing an interpretation that is understandable. Tree regularization does not do this, but it is one approach that uses regularization for interpretability.
If you were told that you could study the weight matrices of a neural network for one hour, and then after that hour you’d be given pen and paper and a bunch of testing data, I don’t think you’d be able to classify images well (unless the network was very small and so you could memorize the weights). By contrast, one can study MCTS for one hour and can use it to play games. That’s the relevant distinction, though I perhaps phrased it poorly.
I agree that this is similar to compressibility. I will have to think longer for whether they are similar enough for a distinction to be meaningless.
Ah, gotcha. I think this is a bit different to compressibility: if you formalise it as Kolmogorov complexity, then you can have a very compressible algorithm that in fact you can’t compress given time limits, because it’s too hard to figure out how to compress it. This seems more like ‘de facto compressibility’, which might be formalised using the speed prior or a variant.
Looking back on this, the relevant notion isn’t going to be those distributions, but just plain old Kt complexity: the minimum over programs p that take time t to compute the data of len(p)+log(t).
I’m still not sure about the distinction. A human with an arbitrarily large amount of time & paper could “train” a new NN (instead of working to “extract a decision tree”), and then “use” that NN.
Unless the human could memorize the initialization parameters, they would be using a different neural network to classify. In other words, the training procedure might be simulatable, but not the actual neural networks.
Rather than thinking about simulatability in a Turing sense, ask whether it makes sense in an informal sense: would you know what to do in order to operate the algorithm on paper.
Why wouldn’t the “human-trained network” be identical to the “original network”? [EDIT: sorry, missed your point. If the human knows the logic of the random number generator that was used to initialize the parameters of the original network, they can manually run the same logic themselves.]
By the same logic, any decision tree that is too large for a human to memorize does not allow theory simulatability as defined in the OP.
Presumably the random seed is going to be big and complicated.
It also wouldn’t just be the random seed you’d need to memorize. You’d have to memorize all the training data and labels too. The scenario I gave was intended to be one where you were asked to study an algorithm for a while, and then afterwards operate it on a blank sheet of paper, given testing data. Even if you had the initialization seed, you wouldn’t just be able to spin up neural networks to make predictions...
(just noting that same goes for a decision tree that isn’t small enough s.t. the human can memorize it)
Yes, exactly. That’s why the authors of tree regularization put large penalties for large trees.
I have little to say except that I didn’t want people to take this definition too seriously and now it looks like I’ve done a bad job communicating what I meant. In my head I have a simple notion corresponding to “could I operate this algorithm on a piece of paper if I was asked to study it for a while beforehand.” This is different in my mind from asking whether it is possible to do so on a Turing machine.
But like, I could not operate a large decision tree on a piece of paper if I could study it for a while beforehand, because I wouldn’t remember all of the yes/no questions and their structure.
I could certainly build a decision tree given data, but I could also build a neural network given data.
(Well, actually I think large decision trees and neural nets are both uninterpretable, so I mostly do agree with your definition. I object to having this definition of interpretability and believing that decision trees are interpretable.)
Yes, I wrote in the main post that the authors of tree regularization penalized large trees. Certainly small decision trees are interpretable, and large trees much less so.
Oh, sure, but if you train on a complex task like image classification, you’re only going to get large decision trees (assuming you get decent accuracy), even with regularization.
(Also, why not just use the decision tree if it’s interpretable? Why bother with a neural net at all?)
The hope is that we can design some regularization scheme that provides us the performance of neural networks while providing an interpretation that is understandable. Tree regularization does not do this, but it is one approach that uses regularization for interpretability.