Consider the following two theories about why deep learning is currently the dominant paradigm of AI research:
Deep learning methods are actually superior to other approaches in fundamental ways. Possibly this is because they are closer to the real structure of human brains. Perhaps it is because they can be used to build complex models, and complex models are necessary to describe a complex world. Deep learning is the result of an exploratory process in AI, which after long deliberation picked machine learning as the right family of methods, and deep learning as the right species within that family.
Deep learning is not superior to other approaches in any fundamental way. Instead, the apparent spectacular success of deep learning comes from the fact that the colloquial version of Moore’s Law (computer speeds double every 18 months) broke down about 10 years ago. Rapidly increasing computational power is now only available if one uses the GPU, and DL neural network algorithms are well-suited to run on the GPU. Therefore, of all the possible approaches to AI, DL is the only approach that can take advantage of increases in computer speeds over the last 10-15 years.
Which theory is more plausible? Different opinions on this question could lead to very different predictions for AI timelines. Theory #1 is “optimistic”, in the sense that it implies the AI field has made a lot of strong progress, by finding the specific family of techniques that are going to become really powerful. Theory #2 is “pessimistic”, in the sense that it means the field misattributed the apparent success of DL, and has therefore been led astray.
I see 2 as a special case of 1. No computational model is inherently superior to another, it is just better for the data / computational power / necessities we have at hand.
I am not a ML Research Scientist—but have studied and used it, and find it very interesting.
As far as I understand, deep learning is able to discover and fit any nonlinear dynamics in a set of data, which then of course has to be trained/regularized/cross-validated to prune away over-fitting. If we accept the view that reality is just a huge set of nonlinear equations and information, and NN/DL can discover these at any level of granularity, then it is a reasonable prediction that they are well posed to be the best.
Also I’m not confident in this next point, but would love some additional feedback. Read this part with skepticism: As I understand effective DL works so well because it combines filtering and tractability within the model structure, and with variation on layering/neurons/optimization techniques, it opens up a greater set of potential models than many other model classes. This makes the fact that it works so well with GPU not a lucky accident, but rather an intrinsic feature of the mathematical structure of the model. Perhaps that’s why we evolved to use NN type structure in our brain—due to its tractability and parallel information processing abilities?
To use an example from my own research, in Financial Econometric asset pricing we often use this tool called a Kalman Filter to filter out states of the world from sets of stochastic PDEs. Optimizing those models, when they have more than ~10 parameters, is such a hassle. It requires lots of optimization black-magic, which is a quasi-scientific method where over months you run different optimization algorithms on the whole model, then single parameters, then the whole model, and if some parameter looks ‘weird’ you manually change it to what you ‘think’ it should be.
Basically a neural network could learn the dynamics here (without revealing them to us), and provide a potentially equal forecast (haven’t tested this). It could also do this much faster than our optimization method (I think). The forecast would be less useful to a human analyst, because without the model dynamics made explicit, it is much harder to run simulations and study specific parameters. But it’s within the realm of reason to predict that a computer, were it self aware, would be able to understand the way the parameters work in the NN/DL model itself.
For this reason I think the natural structure of the model makes it somewhat true that it is #2, but that this makes it a special case of #1 (as MrMind pointed out before me).
Again, I’m not a ML Research Scientist—so if I’ve totally messed something up I’d love to know what and why.
I can’t comment usefully on everything you wrote, so I’ll just say a couple of things.
First, don’t be too credulous: the field of AI has been surrounded and plagued by hype since its inception, the current era isn’t much different. Researchers have every incentive to encourage the hype.
Second, it’s interesting that you bring up the Kalman Filter, because it makes a nice contrast to DNNs. The Kalman filter is actually kind of nice aesthetically, it has a pleasing mathematical elegance to it. People who use the KF know more or less the limits of its applicability. When I’m reading DNN papers, I feel like the whole field has given up on the notion of aesthetics and wholeheartedly embraced architecture hacking as a methodology.
Third, I think you’ll find that the DNNs are much much harder to use than you imagine or expect. The problem is that all DNN research relies on architecture hacking: write down a network, train it up, look at the result, then tweak the architecture and repeat. There is very little, embarrassingly little theory behind it all. The phrase “we have found” is prominent in DNN papers, meaning “we tweaked the network a bunch of times in various ways and found that this trick worked the best.” Furthermore, each cycle of code/test/tweak takes a really long time since DNN training, almost by definition, is very time-consuming.
To address your third point first, I’m sure you are right. I have only played around with simple NNs, and shouldn’t have spoken freely on how it would be easy to estimate a more complex one, when I don’t know much about it.
As a follow up question to your second point: The Kalman filter is a very aesthetically pleasing model, I agree. Something I wonder, but have no idea on, is whether there are mathematical concepts similar to the Kalman filter (in terms of aesthetics and usefulness) that are entirely outside of the understanding of the human brain. So, hypothetically, if we engineered humans with IQ 200+ (or whatever), they would uncover things like the Kalman Filter that normal humans couldn’t grasp.
If that’s true, does it stand to reason that we could still use those models with a sufficiently well optimized/built DNN? We would just never understand what’s going on inside the network?
I often think of self-driving cars as learning the dynamic interactions of a set of nonlinear equations that are beyond the scope of a human to ever derive.
I’ll note I realize some of my questions might be too vague or pseudo-philosophical to be answered.
PS: I did a little internet sleuthing and have read the first ~12 pages of your book so far, which is very interesting and similar to how I think of the world (yours is much more well developed). I am also incredibly interested in empirical philosci and read/write/think about it a ton.
Consider the following two theories about why deep learning is currently the dominant paradigm of AI research:
Deep learning methods are actually superior to other approaches in fundamental ways. Possibly this is because they are closer to the real structure of human brains. Perhaps it is because they can be used to build complex models, and complex models are necessary to describe a complex world. Deep learning is the result of an exploratory process in AI, which after long deliberation picked machine learning as the right family of methods, and deep learning as the right species within that family.
Deep learning is not superior to other approaches in any fundamental way. Instead, the apparent spectacular success of deep learning comes from the fact that the colloquial version of Moore’s Law (computer speeds double every 18 months) broke down about 10 years ago. Rapidly increasing computational power is now only available if one uses the GPU, and DL neural network algorithms are well-suited to run on the GPU. Therefore, of all the possible approaches to AI, DL is the only approach that can take advantage of increases in computer speeds over the last 10-15 years.
Which theory is more plausible? Different opinions on this question could lead to very different predictions for AI timelines. Theory #1 is “optimistic”, in the sense that it implies the AI field has made a lot of strong progress, by finding the specific family of techniques that are going to become really powerful. Theory #2 is “pessimistic”, in the sense that it means the field misattributed the apparent success of DL, and has therefore been led astray.
I see 2 as a special case of 1. No computational model is inherently superior to another, it is just better for the data / computational power / necessities we have at hand.
I am not a ML Research Scientist—but have studied and used it, and find it very interesting.
As far as I understand, deep learning is able to discover and fit any nonlinear dynamics in a set of data, which then of course has to be trained/regularized/cross-validated to prune away over-fitting. If we accept the view that reality is just a huge set of nonlinear equations and information, and NN/DL can discover these at any level of granularity, then it is a reasonable prediction that they are well posed to be the best.
Also I’m not confident in this next point, but would love some additional feedback. Read this part with skepticism: As I understand effective DL works so well because it combines filtering and tractability within the model structure, and with variation on layering/neurons/optimization techniques, it opens up a greater set of potential models than many other model classes. This makes the fact that it works so well with GPU not a lucky accident, but rather an intrinsic feature of the mathematical structure of the model. Perhaps that’s why we evolved to use NN type structure in our brain—due to its tractability and parallel information processing abilities?
To use an example from my own research, in Financial Econometric asset pricing we often use this tool called a Kalman Filter to filter out states of the world from sets of stochastic PDEs. Optimizing those models, when they have more than ~10 parameters, is such a hassle. It requires lots of optimization black-magic, which is a quasi-scientific method where over months you run different optimization algorithms on the whole model, then single parameters, then the whole model, and if some parameter looks ‘weird’ you manually change it to what you ‘think’ it should be.
Basically a neural network could learn the dynamics here (without revealing them to us), and provide a potentially equal forecast (haven’t tested this). It could also do this much faster than our optimization method (I think). The forecast would be less useful to a human analyst, because without the model dynamics made explicit, it is much harder to run simulations and study specific parameters. But it’s within the realm of reason to predict that a computer, were it self aware, would be able to understand the way the parameters work in the NN/DL model itself.
For this reason I think the natural structure of the model makes it somewhat true that it is #2, but that this makes it a special case of #1 (as MrMind pointed out before me).
Again, I’m not a ML Research Scientist—so if I’ve totally messed something up I’d love to know what and why.
I can’t comment usefully on everything you wrote, so I’ll just say a couple of things.
First, don’t be too credulous: the field of AI has been surrounded and plagued by hype since its inception, the current era isn’t much different. Researchers have every incentive to encourage the hype.
Second, it’s interesting that you bring up the Kalman Filter, because it makes a nice contrast to DNNs. The Kalman filter is actually kind of nice aesthetically, it has a pleasing mathematical elegance to it. People who use the KF know more or less the limits of its applicability. When I’m reading DNN papers, I feel like the whole field has given up on the notion of aesthetics and wholeheartedly embraced architecture hacking as a methodology.
Third, I think you’ll find that the DNNs are much much harder to use than you imagine or expect. The problem is that all DNN research relies on architecture hacking: write down a network, train it up, look at the result, then tweak the architecture and repeat. There is very little, embarrassingly little theory behind it all. The phrase “we have found” is prominent in DNN papers, meaning “we tweaked the network a bunch of times in various ways and found that this trick worked the best.” Furthermore, each cycle of code/test/tweak takes a really long time since DNN training, almost by definition, is very time-consuming.
To address your third point first, I’m sure you are right. I have only played around with simple NNs, and shouldn’t have spoken freely on how it would be easy to estimate a more complex one, when I don’t know much about it.
As a follow up question to your second point: The Kalman filter is a very aesthetically pleasing model, I agree. Something I wonder, but have no idea on, is whether there are mathematical concepts similar to the Kalman filter (in terms of aesthetics and usefulness) that are entirely outside of the understanding of the human brain. So, hypothetically, if we engineered humans with IQ 200+ (or whatever), they would uncover things like the Kalman Filter that normal humans couldn’t grasp.
If that’s true, does it stand to reason that we could still use those models with a sufficiently well optimized/built DNN? We would just never understand what’s going on inside the network?
I often think of self-driving cars as learning the dynamic interactions of a set of nonlinear equations that are beyond the scope of a human to ever derive.
I’ll note I realize some of my questions might be too vague or pseudo-philosophical to be answered.
PS: I did a little internet sleuthing and have read the first ~12 pages of your book so far, which is very interesting and similar to how I think of the world (yours is much more well developed). I am also incredibly interested in empirical philosci and read/write/think about it a ton.