That’s not actually what I meant when I said we have the math figured out. The math behind general learning is just general bayesian inference in it’s various forms. The difficulty is not so much in the math, it is in scaling up efficiently.
Ok, but this is even more vague then. At least neural networks are a coherent class of algorithms, with lots of architectural variations and hyperparameters to tune, but still functionally similar. General Bayesian inference, on the other hand, is a broad framework with dozens types of algorithms for different tasks, based on different assumptions and with different functional structure.
You could as well say that once we formulated the theory of universal computation and we had the first digital computers up and running, then we had all the math figured out and it was just a matter of scaling up things. This was probably the sentiment at the famous Dartmouth conference in 1956 where they predicted that ten smart people brainstorming for two months could make significant advancements in multiple fundamental AI problems. I think that we know better now.
Regarding unsupervised learning—I wholeheartedly agree. However one should also keep in mind that UL and SL are just minor variations of the same theme in a bayesian framework. If you have accurate labeled data, you might as well use it.
Supervised learning may be a special case of unsupervised learning but not the other way round. Currently we can only do supervised learning well, at least when when big data is available. There have been attempts to reduce unsupervised learning to supervised learning, which had some practical success in textual NLP (with neural language models and word vectors) but not in other domains such as vision and speech.
The paper I linked, IMHO, may shed some light on why this happened: one of the most popular evaluation measure and training objective, the negative log-likelihood (aka empirical cross-entropy), which captures well our intuition of what a good model must do in binary (or low-dimensional) classification tasks, may break down in the high-dimensional regime, typical of some unsupervised tasks such as sampling.
It’s not clear how often their particular pathological special case actually shows up in practice.
I’ve never seen a modern generative model generate realistic samples of natural images or speech. Text generation fares somewhat better, but it’s still far from anything able to pass a Turing test. By contrast, discriminative models for classification or regression trained on large supervised data can often achieve human-level or even super-human performances.
In general a solomonoff learner will not have that problem.
Well, duh, but a Solomonoff learner is uncomputable. Inside a Solomonoff learner there would be a simulation of every possible human looking at the samples, among an infinite number of other things.
At least neural networks are a coherent class of algorithms, with lots of architectural variations and hyperparameters to tune, but still functionally similar. General Bayesian inference, on the other hand, is a broad framework with dozens types of algorithms for different tasks, based on different assumptions and with different functional structure.
I don’t agree with this memetic taxonomy. I consider neural networks to be mostly synonymous with algebraic tensor networks—general computational graphs over tensors. As such ANN describes a modeling language family, equivalent in expressibility to binary circuit models (and thus Turing universal) but considerably more computationally efficient. The tensor algebra abstraction more closely matches physical hardware reality.
So as a general computing paradagim or circuit model, ANNs can be combined with any approximate inference technique. Backprop on log-likelihood is just one obvious approx method.
You could as well say that once we formulated the theory of universal computation and we had the first digital computers up and running, then we had all the math figured out
Not quite, because it took longer for the math for inference/learning to be worked out, and even somewhat longer for efficient approximations—and indeed that work is still ongoing.
Regardless, even if all the math was available in 1956 it wouldn’t of mattered, as they still would have had to wait 60 years or so for efficient implementations (hardware + software).
The paper I linked, IMHO, may shred some light on why this happened: one of the most popular evaluation measure and training objective, the negative log-likelihood (aka empirical cross-entropy), which captures well our intuition of what a good model must do in binary (or low-dimensional) classification tasks, may break down in the high-dimensional regime, typical of some unsupervised tasks such as sampling.
To the extant that this is a problem in practice, it’s a problem with typical sampling, not the measure itself. As I mentioned earlier, I believe it can be solved by more advanced sampling techniques that respect total KC/Solomonoff probability. Using these hypothetical correct samplers, good models should always produce good samples.
That being said I agree that generative modelling and realistic sampling in particular is an area ripe for innovation.
I’ve never seen a modern generative model generate realistic samples of natural images or speech.
You actually probably have seen this in the form of CG in realistic video games or films. Of course those models are hand crafted rather than learned probabilistic generative models. I believe that cross-fertilization of ideas/techniques from graphics and ML will transform both in the near future.
The current image generative models in ML are extremely weak when viewed as procedural graphics engines—for the most part they are just 2D image blenders.
Ok, but this is even more vague then. At least neural networks are a coherent class of algorithms, with lots of architectural variations and hyperparameters to tune, but still functionally similar. General Bayesian inference, on the other hand, is a broad framework with dozens types of algorithms for different tasks, based on different assumptions and with different functional structure.
You could as well say that once we formulated the theory of universal computation and we had the first digital computers up and running, then we had all the math figured out and it was just a matter of scaling up things. This was probably the sentiment at the famous Dartmouth conference in 1956 where they predicted that ten smart people brainstorming for two months could make significant advancements in multiple fundamental AI problems. I think that we know better now.
Supervised learning may be a special case of unsupervised learning but not the other way round. Currently we can only do supervised learning well, at least when when big data is available. There have been attempts to reduce unsupervised learning to supervised learning, which had some practical success in textual NLP (with neural language models and word vectors) but not in other domains such as vision and speech.
The paper I linked, IMHO, may shed some light on why this happened: one of the most popular evaluation measure and training objective, the negative log-likelihood (aka empirical cross-entropy), which captures well our intuition of what a good model must do in binary (or low-dimensional) classification tasks, may break down in the high-dimensional regime, typical of some unsupervised tasks such as sampling.
I’ve never seen a modern generative model generate realistic samples of natural images or speech. Text generation fares somewhat better, but it’s still far from anything able to pass a Turing test. By contrast, discriminative models for classification or regression trained on large supervised data can often achieve human-level or even super-human performances.
Well, duh, but a Solomonoff learner is uncomputable. Inside a Solomonoff learner there would be a simulation of every possible human looking at the samples, among an infinite number of other things.
I don’t agree with this memetic taxonomy. I consider neural networks to be mostly synonymous with algebraic tensor networks—general computational graphs over tensors. As such ANN describes a modeling language family, equivalent in expressibility to binary circuit models (and thus Turing universal) but considerably more computationally efficient. The tensor algebra abstraction more closely matches physical hardware reality.
So as a general computing paradagim or circuit model, ANNs can be combined with any approximate inference technique. Backprop on log-likelihood is just one obvious approx method.
Not quite, because it took longer for the math for inference/learning to be worked out, and even somewhat longer for efficient approximations—and indeed that work is still ongoing.
Regardless, even if all the math was available in 1956 it wouldn’t of mattered, as they still would have had to wait 60 years or so for efficient implementations (hardware + software).
To the extant that this is a problem in practice, it’s a problem with typical sampling, not the measure itself. As I mentioned earlier, I believe it can be solved by more advanced sampling techniques that respect total KC/Solomonoff probability. Using these hypothetical correct samplers, good models should always produce good samples.
That being said I agree that generative modelling and realistic sampling in particular is an area ripe for innovation.
You actually probably have seen this in the form of CG in realistic video games or films. Of course those models are hand crafted rather than learned probabilistic generative models. I believe that cross-fertilization of ideas/techniques from graphics and ML will transform both in the near future.
The current image generative models in ML are extremely weak when viewed as procedural graphics engines—for the most part they are just 2D image blenders.