Very interesting. Can you say more about what sorts of things can be predicted by this theory of neural networks? What kind of dent do you think this knowledge can make into interpretability research?
The predictions laid out in the book are mostly about how to build a perceptron such that representation learning works well in practice, and that the generalisation error gets minimised. For example,
When you train with (stochastic) gradient descent, you have to scale the learning rate differently for different layers and also differently for weights and biases. The theory tells you specifically how to scale them, and how this depends on activation functions. If you don’t do that, the theory predicts among other things that the change in performance of your network from instantiation to instantiation will be greater.
The theory predicts that representation learning in deep perceptrons depends substantially on the depth-to-width ratio. For example, if the network is overly deep (depth similar to, or greater than width), you get strong coupling and chaotic behaviour. Also, your width must be large but finite for representation learning to work. An optimal ratio is also approximated.
Many more concrete, testable, numeric results like this are derived. The idea is that this is just the beginning and a lot more could potentially be derived. You can use the theory to express any observable (analytic combination of pre-activations anywhere in the network) that you might be interested in and study its statistics.
Very interesting. Can you say more about what sorts of things can be predicted by this theory of neural networks? What kind of dent do you think this knowledge can make into interpretability research?
The predictions laid out in the book are mostly about how to build a perceptron such that representation learning works well in practice, and that the generalisation error gets minimised. For example,
When you train with (stochastic) gradient descent, you have to scale the learning rate differently for different layers and also differently for weights and biases. The theory tells you specifically how to scale them, and how this depends on activation functions. If you don’t do that, the theory predicts among other things that the change in performance of your network from instantiation to instantiation will be greater.
The theory predicts that representation learning in deep perceptrons depends substantially on the depth-to-width ratio. For example, if the network is overly deep (depth similar to, or greater than width), you get strong coupling and chaotic behaviour. Also, your width must be large but finite for representation learning to work. An optimal ratio is also approximated.
Many more concrete, testable, numeric results like this are derived. The idea is that this is just the beginning and a lot more could potentially be derived. You can use the theory to express any observable (analytic combination of pre-activations anywhere in the network) that you might be interested in and study its statistics.