That the same kind of organization is optimal both for computation implemented in biological cells and for computation implemented in a conventional digital computer
That the human brain has actually evolved to employ a close-to-optimal organization of the data
1) seems to me likely to be untrue in literal form, but could possibly be avoided by just building a system that wasn’t necessarily totally digital-optimal. 2) probably depends on the domain—e.g. Körding 2007 mentions that
We have some evidence for situations in which 1.) is true. In vision for example, V1 learns a decomposition of the image into gabor filters. Likewise, most hierarchical machine learning vision systems also learn a first stage of filters based on gabor filters when fed natural image data.
In regards to 2.), exact optimality matters less than optimality relative to the computational power applied to inference.
But on the other hand, I would expect the brain to use suboptimal representations for evolutionary recent tasks, such as doing abstract mathematics.
This assumes that task specific representations are hardwired in by evolution, which is mostly true only for the old brain. The cortex (along with the cerebellum) is essentially the biological equivalent of a large machine learning coprocessor, and at birth it has random connections, very much like any modern ML system, like ANNs. It appears that the cortex uses the same general learning algorithms to learn everything from vision to physics. This is the ‘one learning algorithm’ hypothesis, and has much support at this point. At a high level we know that it should be true—after all we know that the stronger forms of bayesian inference can learn anything there is to learn, and the success of modern variants of SGD—which can be seen as a scalable approximation of bayesian inference—provides further support. (the brain probably uses something even better than modern SGD, and we are getting closer and closer to matching its inference algorithms—many many researchers are trying to find the next best approximate inference algorithm past SGD)
If we can do that then we can setup the AI to have a speech monologue that is always active, but is only routed to the external output based on an internal binary switch variable the AI controls (similar to humans). And then viola we can have the AI’s stream of thoughts logged to a text file.
That’s a very interesting idea. One challenge that comes to mind is that since the AI’s internal world-model would be constantly changing, you might need to constantly re-train the language network to understand what the changed concepts correspond to. But since you didn’t know for sure what the new concepts did correspond to, you wouldn’t have a fully reliable training set to re-train it with. Still, you might be able to pull it off anyway.
Yes—that is a specific instance of the general training dependency problem. One general solution is to train everything together. In this specific case I imagine that once the language/speech output module is hooked up, we can then begin training it online with the rest of the system using whatever RL type criterion we are using to train the other motor output modules. So in essence the AI will learn to improve its language output capability so as to better communicate with humans, and this improvement will naturally co-adapt to internal model changes.
This assumes that task specific representations are hardwired in by evolution, which is mostly true only for the old brain. The cortex (along with the cerebellum) is essentially the biological equivalent of a large machine learning coprocessor, and at birth it has random connections, very much like any modern ML system, like ANNs. It appears that the cortex uses the same general learning algorithms to learn everything from vision to physics. This is the ‘one learning algorithm’ hypothesis, and has much support at this point.
I agree that there seems to be good evidence for the ‘one learning algorithm’ hypothesis… but there also seems to be reasonable evidence for modules that are specialized for particular tasks that were evolutionary useful; the most obvious example would be the extent to which we seem to have specialized reasoning capacity for modeling and interacting with other people, capacity which is to varying extent impaired in people on the autistic spectrum.
Even if one does assume that the cortex used the same learning algorithms for literally everything, one would still expect the parameters and properties of those algorithms to be at least partially genetically tuned towards the kinds of learning tasks that were most useful in the EEA (though of course the environment should be expected to carry out further tuning of the said parameters). I don’t think that the brain learning everything using the same algorithms would disprove the notion that there could exist alternative algorithms better optimized for learning e.g. abstract mathematics, and which could also employ a representation that was better optimized for abstract math, at the cost of being worse at more general learning of the type most useful in the EEA.
I agree that there seems to be good evidence for the ‘one learning algorithm’ hypothesis… but there also seems to be reasonable evidence for modules that are specialized for particular tasks that were evolutionary useful
The paper you linked to is long-winded. I jumped to the section titled “Do Modules Require Their Own Genes?”. I skimmed a bit and concluded that the authors were missing huge tracks of key recent knowledge from comp and developmental neuroscience and machine learning, and as a result they are fumbling in the dark.
the most obvious example would be the extent to which we seem to have specialized reasoning capacity for modeling and interacting with other people, capacity which is to varying extent impaired in people on the autistic spectrum.
Learning will automatically develop any number of specialized capabilities just as a natural organic process of interacting with the environment. Machine learning provides us with concrete specific knowledge of how this process actually works. The simplest explanation for autism inevitably involves disruptions to learning machinery, not disruptions to preconfigured “people interaction modules”.
Again to reiterate—obviously there are preconfigured modules—it is just that they necessarily form a tiny portion of the total circuitry.
Even if one does assume that the cortex used the same learning algorithms for literally everything, one would still expect the parameters and properties of those algorithms to be at least partially genetically tuned towards the kinds of learning tasks that were most useful in the EEA (though of course the environment should be expected to carry out further tuning of the said parameters).
Perhaps, perhaps not. Certainly genetics specifies a prior over model space. You can think of evolution wanting to specify as much as it can, but with only a tiny amount of code. So it specifies the brain in a sort of ultra-compressed hierarchical fashion. The rough number of main modules, neuron counts per module, and gross module connectivity are roughly pre-specified, and then within each module there are a just a few types of macrocircuits, each of which is composed of a few types of repeating microcircuits, and so on.
Using machine learning as an analogy, to solve a specific problem we typically come up with a general architecture that forms a prior over model space that we believe is well adapted to the problem. Then we use a standard optimization engine—like SGD—to handle the inference/learning given that model. The learning algorithms are very general purpose and cross domain.
I don’t think that the brain learning everything using the same algorithms would disprove the notion that there could exist alternative algorithms better optimized for learning e.g. abstract mathematics, and which could also employ a representation that was better optimized for abstract math, at the cost of being worse at more general learning of the type most useful in the EEA.
The distinction between the ‘model prior’ and the ‘learning algorithm’ is not always so clear cut, and some interesting successes in the field of metalearning suggest that there indeed exists highly effective specialized learning algorithms for at least some domains.
one would still expect the parameters and properties of those algorithms to be at least partially genetically tuned towards the kinds of learning tasks that were most useful in the EEA
Compare jacob_cannell’s earlier point that
obviously for any set of optimization criteria, constraints (including computational), and dataset there naturally can only ever be a single optimal solution (emphasis added)
Do we know or can we reasonably infer what those optimization criteria were like, so that we can implement them into our AI? If not, how likely and by how much would we expect the optimal solution to change?
At a high level we know that it should be true—after all we know that the stronger forms of bayesian inference can learn anything there is to learn, and the success of modern variants of SGD—which can be seen as a scalable approximation of bayesian inference—provides further support. (the brain probably uses something even better than modern SGD, and we are getting closer and closer to matching its inference algorithms—many many researchers are trying to find the next best approximate inference algorithm past SGD)
I really agree with your general point but this isn’t correct. Bayesian inference can only learn something so long as the model specified is correct. I know this is kind of pedantic but it’s important to keep in mind.
E.g. there are some functions a simple bayesian network won’t be able to model well without exponentially many parameters and training examples. Because of the no free lunch theorem all models have weaknesses in some cases.
Of course some people might say the no free lunch theorem is useless since we can assume real world problems are drawn from some distribution of simple computable models in some Turing complete language. However this doesn’t really help us since we can’t do efficient inference on anything remotely like a Turing complete language, and so must use much more restricted models.
SGD is not an approximation of bayesian inference. It has nothing to do with bayesianism. It’s just a general optimization algorithm which is useful for fitting models to data.
And I doubt the brain uses anything better than SGD. I would be very surprised if it’s even half as efficient as SGD. Reason being that computers are numerically accurate through many layers and many timesteps, while the brain is extremely noisy and can’t do global algorithms like that. Additionally computers can iterate through a dataset many times and fine tune every parameters, while brains only get to see things once.
However that’s fine since SGD seems to be more than enough for modern NNs. Inventing a 10x more efficient optimization algorithm would just mean you can train the nets slightly faster. But training time isn’t the limiting factor for the most part.
SGD is not an approximation of bayesian inference. It has nothing to do with bayesianism. It’s just a general optimization algorithm which is useful for fitting models to data.
How well read are you in machine learning? 10′s of papers? 100′s? 1000′s? PHD level? This and your other comment about IRL suggests that you have only cursory knowledge of the field. Also ‘bayesianism’ isn’t a thing, outside of LW.
Bayesian inference is also just an “algorithm which is useful for fitting models to data.”
Inference problems can be turned into optimization problems and vice versa. In particular the single MLE estimate from a full exhaustive inference over some data set conditioned on some observable is exactly equivalent to a global optimization problem solved with exhaustive search.
Exhaustive methods have exponential order costs, so the first obvious large improvement is to approximate the full joint CDF by a factored graphical model, such as a factor graph. For real valued variables, tracking full distributions is still quite expensive, so the next level of approximation/optimization is to use simple analytic distributions such as gaussians. Another useful approximation then is to use some incremental sampling algorithm.
SGD type algorithms are equivalent to approximate MLE inference where only the mean of each variable is tracked, update messages are swept through the data in a simple fashion, and the variance is related to the learning rate.
I apologize if my comment came off as rude. I certainly didn’t mean to assert any kind of authority over this. I am just a hobbyists and some minor points you made bothered me. Mainly the comment about the limiting factor of NNs being the optimization algorithm they use, or that the brain uses something far better. The points about bayesian inference was just tangential.
I didn’t mean bayesianism the philosophy, just bayesian methods.
Bayesian inference is also just an “algorithm which is useful for fitting models to data.”
Yes but it’s not an optimization algorithms. Optimization algorithms are more general than statistics. You can use an optimization algorithm to find the optimal parameters for an airplane wing or the shortest path between several cities.
Conversely bayesian inference doesn’t specify how the parameters should be optimized, just that you should somehow weigh every possibility according to it’s probability.
I am not saying that they aren’t related at all, just that it’s worth distinguishing them as qualitatively different concepts, where you seem to use them interchangeably.
.. .some minor points you made bothered me. Mainly the comment about the limiting factor of NNs being the optimization algorithm they use, or that the brain uses something far better.
I didn’t say SGD is the main limiting factor of ANNs, or that the brain using something far better. I said “the brain probably uses something even better than modern SGD ..”
Modern SGD methods—especially with auto learning rate tuning and the new normalization schemes (which btw relates directly to better variance/uncertainty models in stat inference methods) - are pretty powerful, but they still learn somewhat slowly, requiring numerous passes through the data to reach a good solution.
I don’t have time to dig deep into how the brain may use techniques better than SGD … but as a simple single example of one thing it does better: current SGD ANN training computes the same update steps for the same high cost across the entire network for every training example, even though examples vary vastly in their novelty/difficulty/utility of learning. The brain appears to be much better about managing its limited resources.
I am not saying that [inference and optimization] aren’t related at all, just that it’s worth distinguishing them as qualitatively different concepts, where you seem to use them interchangeably.
They are largely interchangeable in machine learning in the sense that you can use optimization techniques (SGD) or inference techniques (expectation propagation, expectation backpropagation, MCMC, etc) to train a model (such as an ANN).
Much of the ‘wisdom’ or deep insightful knowledge in a particular field consists of learning all the structural relations and symmetries between different algorithms/techniques which enable internal mental compression of all of the raw low level knowledge: learning which techniques are generalizations, specializations, approximations, or restricted transformations of others. In the beginning, everything looks disconnected and compartmentalized, but eventually one sees how everything is connected.
General optimization can be used to implement inference, and vice versa. You can recast optimization as an inference problem: the initial settings/constraints become a prior, the utility/loss function is converted into a probability measure, learning rates relate to variance/precision, etc. See survey papers such as “Representation Learning”, or look into the use of bayesian methods in machine learning (as replacements for optimization methods) to get some perspective on how they all relate.
We have some evidence for situations in which 1.) is true. In vision for example, V1 learns a decomposition of the image into gabor filters. Likewise, most hierarchical machine learning vision systems also learn a first stage of filters based on gabor filters when fed natural image data.
In regards to 2.), exact optimality matters less than optimality relative to the computational power applied to inference.
This assumes that task specific representations are hardwired in by evolution, which is mostly true only for the old brain. The cortex (along with the cerebellum) is essentially the biological equivalent of a large machine learning coprocessor, and at birth it has random connections, very much like any modern ML system, like ANNs. It appears that the cortex uses the same general learning algorithms to learn everything from vision to physics. This is the ‘one learning algorithm’ hypothesis, and has much support at this point. At a high level we know that it should be true—after all we know that the stronger forms of bayesian inference can learn anything there is to learn, and the success of modern variants of SGD—which can be seen as a scalable approximation of bayesian inference—provides further support. (the brain probably uses something even better than modern SGD, and we are getting closer and closer to matching its inference algorithms—many many researchers are trying to find the next best approximate inference algorithm past SGD)
Yes—that is a specific instance of the general training dependency problem. One general solution is to train everything together. In this specific case I imagine that once the language/speech output module is hooked up, we can then begin training it online with the rest of the system using whatever RL type criterion we are using to train the other motor output modules. So in essence the AI will learn to improve its language output capability so as to better communicate with humans, and this improvement will naturally co-adapt to internal model changes.
I agree that there seems to be good evidence for the ‘one learning algorithm’ hypothesis… but there also seems to be reasonable evidence for modules that are specialized for particular tasks that were evolutionary useful; the most obvious example would be the extent to which we seem to have specialized reasoning capacity for modeling and interacting with other people, capacity which is to varying extent impaired in people on the autistic spectrum.
Even if one does assume that the cortex used the same learning algorithms for literally everything, one would still expect the parameters and properties of those algorithms to be at least partially genetically tuned towards the kinds of learning tasks that were most useful in the EEA (though of course the environment should be expected to carry out further tuning of the said parameters). I don’t think that the brain learning everything using the same algorithms would disprove the notion that there could exist alternative algorithms better optimized for learning e.g. abstract mathematics, and which could also employ a representation that was better optimized for abstract math, at the cost of being worse at more general learning of the type most useful in the EEA.
The paper you linked to is long-winded. I jumped to the section titled “Do Modules Require Their Own Genes?”. I skimmed a bit and concluded that the authors were missing huge tracks of key recent knowledge from comp and developmental neuroscience and machine learning, and as a result they are fumbling in the dark.
Learning will automatically develop any number of specialized capabilities just as a natural organic process of interacting with the environment. Machine learning provides us with concrete specific knowledge of how this process actually works. The simplest explanation for autism inevitably involves disruptions to learning machinery, not disruptions to preconfigured “people interaction modules”.
Again to reiterate—obviously there are preconfigured modules—it is just that they necessarily form a tiny portion of the total circuitry.
Perhaps, perhaps not. Certainly genetics specifies a prior over model space. You can think of evolution wanting to specify as much as it can, but with only a tiny amount of code. So it specifies the brain in a sort of ultra-compressed hierarchical fashion. The rough number of main modules, neuron counts per module, and gross module connectivity are roughly pre-specified, and then within each module there are a just a few types of macrocircuits, each of which is composed of a few types of repeating microcircuits, and so on.
Using machine learning as an analogy, to solve a specific problem we typically come up with a general architecture that forms a prior over model space that we believe is well adapted to the problem. Then we use a standard optimization engine—like SGD—to handle the inference/learning given that model. The learning algorithms are very general purpose and cross domain.
The distinction between the ‘model prior’ and the ‘learning algorithm’ is not always so clear cut, and some interesting successes in the field of metalearning suggest that there indeed exists highly effective specialized learning algorithms for at least some domains.
Compare jacob_cannell’s earlier point that
Do we know or can we reasonably infer what those optimization criteria were like, so that we can implement them into our AI? If not, how likely and by how much would we expect the optimal solution to change?
I really agree with your general point but this isn’t correct. Bayesian inference can only learn something so long as the model specified is correct. I know this is kind of pedantic but it’s important to keep in mind.
E.g. there are some functions a simple bayesian network won’t be able to model well without exponentially many parameters and training examples. Because of the no free lunch theorem all models have weaknesses in some cases.
Of course some people might say the no free lunch theorem is useless since we can assume real world problems are drawn from some distribution of simple computable models in some Turing complete language. However this doesn’t really help us since we can’t do efficient inference on anything remotely like a Turing complete language, and so must use much more restricted models.
SGD is not an approximation of bayesian inference. It has nothing to do with bayesianism. It’s just a general optimization algorithm which is useful for fitting models to data.
And I doubt the brain uses anything better than SGD. I would be very surprised if it’s even half as efficient as SGD. Reason being that computers are numerically accurate through many layers and many timesteps, while the brain is extremely noisy and can’t do global algorithms like that. Additionally computers can iterate through a dataset many times and fine tune every parameters, while brains only get to see things once.
However that’s fine since SGD seems to be more than enough for modern NNs. Inventing a 10x more efficient optimization algorithm would just mean you can train the nets slightly faster. But training time isn’t the limiting factor for the most part.
How well read are you in machine learning? 10′s of papers? 100′s? 1000′s? PHD level? This and your other comment about IRL suggests that you have only cursory knowledge of the field. Also ‘bayesianism’ isn’t a thing, outside of LW.
Bayesian inference is also just an “algorithm which is useful for fitting models to data.”
Inference problems can be turned into optimization problems and vice versa. In particular the single MLE estimate from a full exhaustive inference over some data set conditioned on some observable is exactly equivalent to a global optimization problem solved with exhaustive search.
Exhaustive methods have exponential order costs, so the first obvious large improvement is to approximate the full joint CDF by a factored graphical model, such as a factor graph. For real valued variables, tracking full distributions is still quite expensive, so the next level of approximation/optimization is to use simple analytic distributions such as gaussians. Another useful approximation then is to use some incremental sampling algorithm.
SGD type algorithms are equivalent to approximate MLE inference where only the mean of each variable is tracked, update messages are swept through the data in a simple fashion, and the variance is related to the learning rate.
I apologize if my comment came off as rude. I certainly didn’t mean to assert any kind of authority over this. I am just a hobbyists and some minor points you made bothered me. Mainly the comment about the limiting factor of NNs being the optimization algorithm they use, or that the brain uses something far better. The points about bayesian inference was just tangential.
I didn’t mean bayesianism the philosophy, just bayesian methods.
Yes but it’s not an optimization algorithms. Optimization algorithms are more general than statistics. You can use an optimization algorithm to find the optimal parameters for an airplane wing or the shortest path between several cities.
Conversely bayesian inference doesn’t specify how the parameters should be optimized, just that you should somehow weigh every possibility according to it’s probability.
I am not saying that they aren’t related at all, just that it’s worth distinguishing them as qualitatively different concepts, where you seem to use them interchangeably.
I didn’t say SGD is the main limiting factor of ANNs, or that the brain using something far better. I said “the brain probably uses something even better than modern SGD ..”
Modern SGD methods—especially with auto learning rate tuning and the new normalization schemes (which btw relates directly to better variance/uncertainty models in stat inference methods) - are pretty powerful, but they still learn somewhat slowly, requiring numerous passes through the data to reach a good solution.
I don’t have time to dig deep into how the brain may use techniques better than SGD … but as a simple single example of one thing it does better: current SGD ANN training computes the same update steps for the same high cost across the entire network for every training example, even though examples vary vastly in their novelty/difficulty/utility of learning. The brain appears to be much better about managing its limited resources.
They are largely interchangeable in machine learning in the sense that you can use optimization techniques (SGD) or inference techniques (expectation propagation, expectation backpropagation, MCMC, etc) to train a model (such as an ANN).
Much of the ‘wisdom’ or deep insightful knowledge in a particular field consists of learning all the structural relations and symmetries between different algorithms/techniques which enable internal mental compression of all of the raw low level knowledge: learning which techniques are generalizations, specializations, approximations, or restricted transformations of others. In the beginning, everything looks disconnected and compartmentalized, but eventually one sees how everything is connected.
General optimization can be used to implement inference, and vice versa. You can recast optimization as an inference problem: the initial settings/constraints become a prior, the utility/loss function is converted into a probability measure, learning rates relate to variance/precision, etc. See survey papers such as “Representation Learning”, or look into the use of bayesian methods in machine learning (as replacements for optimization methods) to get some perspective on how they all relate.