We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples. The network architecture incorporates knowledge of the target function into the kernel.
Not particularly relevant, I think, but interesting nonetheless.
A first drawback of this paper is that its conclusion assumes that the NN underneath trains with gradient flow (GF), which is the continuous-time version of gradient descent (GD). This is a good assumption if the learning rate is very small, and the resulting GD dynamics closely track the GF differential equation.
This does not seem to be the case in practice. Larger initial learning rates help get better performance (https://arxiv.org/abs/1907.04595), so people use them in practice. If what people use in practice was well-approximated by GF, then smaller learning rates would give the same result. You can use another differential equation that does seem to approximate GD fairly well (http://ai.stanford.edu/blog/neural-mechanics/), but I don’t know if the math from the paper still works out.
Second, as the paper points out, the kernel machine learned by GD is a bit strange in that the coefficients $a_i$ for weighing different $K(x, x_i)$ depend on $x$. Thus, the resulting output function is not in the reproducing kernel Hilbert space of the kernel that is purported to describe the NN. As a result, as kernel machines go, it’s pretty weird. I expect that a lot of the analysis about the output of the learning process (learning theory etc) assumes that the $a_i$ do not depend on the test input $x$.
Due to the need to iterate the vs until convergence, the predictive coding network had roughly a 100x greater computational cost than the backprop network.
The paper claims that predictive coding takes more compute. I agree that predictive coding ought to be more parallelizable. If you are using a GPU then backpropagation is already sufficiently parallelizable. However, it may be that neuromorphic hardware could parallelize better than a GPU, thus producing an increase in compute power that outstrips the 100x greater computational cost of the algorithm itself.
Kind of. Neuromorphics don’t buy you too much benefit for generic feedforward networks, but they dramatically reduce the expenses of convergence. Since the 100x in this paper derives from iterating until the network converges, a neuromorphics implementation (say on Loihi) would directly eliminate that cost.
Is it known whether predictive coding is easier to train than backprop? Local learning rules seem like they would be more parallelizable.
Every Model Learned by Gradient Descent Is Approximately a Kernel Machine seems relevant to this discussion:
Not particularly relevant, I think, but interesting nonetheless.
A first drawback of this paper is that its conclusion assumes that the NN underneath trains with gradient flow (GF), which is the continuous-time version of gradient descent (GD). This is a good assumption if the learning rate is very small, and the resulting GD dynamics closely track the GF differential equation.
This does not seem to be the case in practice. Larger initial learning rates help get better performance (https://arxiv.org/abs/1907.04595), so people use them in practice. If what people use in practice was well-approximated by GF, then smaller learning rates would give the same result. You can use another differential equation that does seem to approximate GD fairly well (http://ai.stanford.edu/blog/neural-mechanics/), but I don’t know if the math from the paper still works out.
Second, as the paper points out, the kernel machine learned by GD is a bit strange in that the coefficients $a_i$ for weighing different $K(x, x_i)$ depend on $x$. Thus, the resulting output function is not in the reproducing kernel Hilbert space of the kernel that is purported to describe the NN. As a result, as kernel machines go, it’s pretty weird. I expect that a lot of the analysis about the output of the learning process (learning theory etc) assumes that the $a_i$ do not depend on the test input $x$.
Good point!
Do you know of any work that applies similar methods to study the equivalent kernel machine learned by predictive coding?
I don’t, and my best guess is that nobody has done it :) The paper you linked is extremely recent.
You’d have to start by finding an ODE model of predictive coding, which I suppose is possible taking limit of the learning rate to 0.
The paper claims that predictive coding takes more compute. I agree that predictive coding ought to be more parallelizable. If you are using a GPU then backpropagation is already sufficiently parallelizable. However, it may be that neuromorphic hardware could parallelize better than a GPU, thus producing an increase in compute power that outstrips the 100x greater computational cost of the algorithm itself.
Kind of. Neuromorphics don’t buy you too much benefit for generic feedforward networks, but they dramatically reduce the expenses of convergence. Since the 100x in this paper derives from iterating until the network converges, a neuromorphics implementation (say on Loihi) would directly eliminate that cost.