when we can expect to have brain level hardware performance
I don’t think our progress in creating an AGI is constrained by hardware at this point. It’s a software problem and you can’t solve it by building larger and more densely packed supercomputers.
almost all of them have no idea what they are talking about
I don’t think our progress in creating an AGI is constrained by hardware at this point
That is now possibly arguably just becoming true for the first time—as we approach the end of Moore’s Law and device geometry shrinks to synapse comparable sizes, densities, etc.
Still, current hardware/software is not all that efficient for the computations that intelligence requires—which is namely enormous amounts of low precision/noisy approximate computing.
t’s a software problem and you can’t solve it by building larger and more densely packed supercomputers.
Of course you can—it just wouldn’t be economical. AGI running on a billion dollar super computer is not practical AGI, as AGI is AI that can do everything a human can do but better—which naturally must include cost.
It isn’t a problem of what math to implement—we have that figured out. It’s a question of efficiency.
AGI running on a billion dollar super computer is not practical
Why not? AGI doesn’t involve emulating Fred the janitor, the first AGI is likely to have a specific purpose and so will likely have huge advantages over meatbags in the particular domain it was made for.
If people were able to build an AGI on a billion-dollar chunk of hardware right now they would certainly do so, if only as a proof of concept. A billion isn’t that much money to a certain class of organizations and people.
It isn’t a problem of what math to implement—we have that figured out.
Oh, really? I’m afraid I find that hard to believe.
AGI running on a billion dollar super computer is not practical
Why not?
Say you have the code/structure for an AGI all figured out, but it runs in real-time on a billion dollar/year supercomputer. You now have to wait decades to train/educate it up to an adult.
Furthermore, the probability that you get the seed code/structure right on the first try is essentially zero. So rather obviously—to even get AGI in the first place you need enough efficiency to run one AGI mind in real-time on something far far less than a supercomputer.
It isn’t a problem of what math to implement—we have that figured out.
Oh, really? I’m afraid I find that hard to believe.
I don’t think that even in ML the school of “let’s just make a bigger neural network” is taken seriously.
Neural networks are prone to overfitting. All the modern big neural networks that are fashionable these days require large amounts of training data. Scale up these networks to the size of the human brain, and, even assuming that you have the hardware resources to run them, you will get something that just memorizes the training set and doesn’t perform any useful generalization.
Humans can learn from comparatively small amounts of data, and in particular from very little and very indirectly supervised data: you don’t have to show a child a thousand apples and push each time an “apple” button on their head for them to learn what an apple looks like.
There is currently lots of research in ML in how to make use of unsupervised data, which is cheaper and more abundant than supervised data, but this is still definitely an open problem, so much that it isn’t even clear what properties we want to model and how to evaluate these models (e.g. check out this recent paper). Therefore, the math relevant to ML has definitely not been all worked out.
I don’t think that even in ML the school of “let’s just make a bigger neural network” is taken seriously.
That’s not actually what I meant when I said we have the math figured out. The math behind general learning is just general bayesian inference in it’s various forms. The difficulty is not so much in the math, it is in scaling up efficiently.
To a first approximation the recent surge in progress in AI is entirely due to just making bigger neural networks. As numerous DL researchers have admitted—the new wave of DL is basically just techniques from the 80′s scaled up on modern GPUs.
Regarding unsupervised learning—I wholeheartedly agree. However one should also keep in mind that UL and SL are just minor variations of the same theme in a bayesian framework. If you have accurate labeled data, you might as well use it.
In order to recognize and verbally name apples, a child must first have years of visual experience. Supervised DL systems trained from scratch need to learn everything from scratch, even the lowest level features. The object in these systems is not to maximize learning from small amounts of training data.
In the limited training data domain and more generally for mixed datasets where there is a large amount of unlabeled data, transfer learning and mixed UL/SL can do better.
properties we want to model and how to evaluate these models (e.g. check out this recent paper).
The only real surprising part of that paper is the “good model, poor sampling” section. It’s not clear how often their particular pathological special case actually shows up in practice. In general a solomonoff learner will not have that problem.
I suspect that a more robust sampling procedure could fix the mismatch. A robust sampler would be one that outputs samples according to their total probability as measured by encoding cost. This corrects the mismatch between the encoder and the sampler. Naively implemented this makes the sampling far more expensive, perhaps exponentially so, but nonetheless it suggests the problem is not fundamental.
That’s not actually what I meant when I said we have the math figured out. The math behind general learning is just general bayesian inference in it’s various forms. The difficulty is not so much in the math, it is in scaling up efficiently.
Ok, but this is even more vague then. At least neural networks are a coherent class of algorithms, with lots of architectural variations and hyperparameters to tune, but still functionally similar. General Bayesian inference, on the other hand, is a broad framework with dozens types of algorithms for different tasks, based on different assumptions and with different functional structure.
You could as well say that once we formulated the theory of universal computation and we had the first digital computers up and running, then we had all the math figured out and it was just a matter of scaling up things. This was probably the sentiment at the famous Dartmouth conference in 1956 where they predicted that ten smart people brainstorming for two months could make significant advancements in multiple fundamental AI problems. I think that we know better now.
Regarding unsupervised learning—I wholeheartedly agree. However one should also keep in mind that UL and SL are just minor variations of the same theme in a bayesian framework. If you have accurate labeled data, you might as well use it.
Supervised learning may be a special case of unsupervised learning but not the other way round. Currently we can only do supervised learning well, at least when when big data is available. There have been attempts to reduce unsupervised learning to supervised learning, which had some practical success in textual NLP (with neural language models and word vectors) but not in other domains such as vision and speech.
The paper I linked, IMHO, may shed some light on why this happened: one of the most popular evaluation measure and training objective, the negative log-likelihood (aka empirical cross-entropy), which captures well our intuition of what a good model must do in binary (or low-dimensional) classification tasks, may break down in the high-dimensional regime, typical of some unsupervised tasks such as sampling.
It’s not clear how often their particular pathological special case actually shows up in practice.
I’ve never seen a modern generative model generate realistic samples of natural images or speech. Text generation fares somewhat better, but it’s still far from anything able to pass a Turing test. By contrast, discriminative models for classification or regression trained on large supervised data can often achieve human-level or even super-human performances.
In general a solomonoff learner will not have that problem.
Well, duh, but a Solomonoff learner is uncomputable. Inside a Solomonoff learner there would be a simulation of every possible human looking at the samples, among an infinite number of other things.
At least neural networks are a coherent class of algorithms, with lots of architectural variations and hyperparameters to tune, but still functionally similar. General Bayesian inference, on the other hand, is a broad framework with dozens types of algorithms for different tasks, based on different assumptions and with different functional structure.
I don’t agree with this memetic taxonomy. I consider neural networks to be mostly synonymous with algebraic tensor networks—general computational graphs over tensors. As such ANN describes a modeling language family, equivalent in expressibility to binary circuit models (and thus Turing universal) but considerably more computationally efficient. The tensor algebra abstraction more closely matches physical hardware reality.
So as a general computing paradagim or circuit model, ANNs can be combined with any approximate inference technique. Backprop on log-likelihood is just one obvious approx method.
You could as well say that once we formulated the theory of universal computation and we had the first digital computers up and running, then we had all the math figured out
Not quite, because it took longer for the math for inference/learning to be worked out, and even somewhat longer for efficient approximations—and indeed that work is still ongoing.
Regardless, even if all the math was available in 1956 it wouldn’t of mattered, as they still would have had to wait 60 years or so for efficient implementations (hardware + software).
The paper I linked, IMHO, may shred some light on why this happened: one of the most popular evaluation measure and training objective, the negative log-likelihood (aka empirical cross-entropy), which captures well our intuition of what a good model must do in binary (or low-dimensional) classification tasks, may break down in the high-dimensional regime, typical of some unsupervised tasks such as sampling.
To the extant that this is a problem in practice, it’s a problem with typical sampling, not the measure itself. As I mentioned earlier, I believe it can be solved by more advanced sampling techniques that respect total KC/Solomonoff probability. Using these hypothetical correct samplers, good models should always produce good samples.
That being said I agree that generative modelling and realistic sampling in particular is an area ripe for innovation.
I’ve never seen a modern generative model generate realistic samples of natural images or speech.
You actually probably have seen this in the form of CG in realistic video games or films. Of course those models are hand crafted rather than learned probabilistic generative models. I believe that cross-fertilization of ideas/techniques from graphics and ML will transform both in the near future.
The current image generative models in ML are extremely weak when viewed as procedural graphics engines—for the most part they are just 2D image blenders.
Say you have the code/structure for an AGI all figured out
How would you know that you have it “all figured out”?
[Furthermore], the probability that you get the seed code/structure right on the first try is essentially zero
Err… didn’t you just say that it’s not a software issue and we have already figured out what math to implement? What’s the problem?
No, I never said it is not a software issue—because the distinction between software/hardware issues is murky at best, especially in the era of ML where most of the ‘software’ is learned automatically.
You are trolling now—cutting my quotes out of context.
I don’t think our progress in creating an AGI is constrained by hardware at this point. It’s a software problem and you can’t solve it by building larger and more densely packed supercomputers.
Yep :-)
That is now possibly arguably just becoming true for the first time—as we approach the end of Moore’s Law and device geometry shrinks to synapse comparable sizes, densities, etc.
Still, current hardware/software is not all that efficient for the computations that intelligence requires—which is namely enormous amounts of low precision/noisy approximate computing.
Of course you can—it just wouldn’t be economical. AGI running on a billion dollar super computer is not practical AGI, as AGI is AI that can do everything a human can do but better—which naturally must include cost.
It isn’t a problem of what math to implement—we have that figured out. It’s a question of efficiency.
Why not? AGI doesn’t involve emulating Fred the janitor, the first AGI is likely to have a specific purpose and so will likely have huge advantages over meatbags in the particular domain it was made for.
If people were able to build an AGI on a billion-dollar chunk of hardware right now they would certainly do so, if only as a proof of concept. A billion isn’t that much money to a certain class of organizations and people.
Oh, really? I’m afraid I find that hard to believe.
Say you have the code/structure for an AGI all figured out, but it runs in real-time on a billion dollar/year supercomputer. You now have to wait decades to train/educate it up to an adult.
Furthermore, the probability that you get the seed code/structure right on the first try is essentially zero. So rather obviously—to even get AGI in the first place you need enough efficiency to run one AGI mind in real-time on something far far less than a supercomputer.
Hard to believe only for those outside ML.
I don’t think that even in ML the school of “let’s just make a bigger neural network” is taken seriously.
Neural networks are prone to overfitting. All the modern big neural networks that are fashionable these days require large amounts of training data. Scale up these networks to the size of the human brain, and, even assuming that you have the hardware resources to run them, you will get something that just memorizes the training set and doesn’t perform any useful generalization.
Humans can learn from comparatively small amounts of data, and in particular from very little and very indirectly supervised data: you don’t have to show a child a thousand apples and push each time an “apple” button on their head for them to learn what an apple looks like.
There is currently lots of research in ML in how to make use of unsupervised data, which is cheaper and more abundant than supervised data, but this is still definitely an open problem, so much that it isn’t even clear what properties we want to model and how to evaluate these models (e.g. check out this recent paper).
Therefore, the math relevant to ML has definitely not been all worked out.
That’s not actually what I meant when I said we have the math figured out. The math behind general learning is just general bayesian inference in it’s various forms. The difficulty is not so much in the math, it is in scaling up efficiently.
To a first approximation the recent surge in progress in AI is entirely due to just making bigger neural networks. As numerous DL researchers have admitted—the new wave of DL is basically just techniques from the 80′s scaled up on modern GPUs.
Regarding unsupervised learning—I wholeheartedly agree. However one should also keep in mind that UL and SL are just minor variations of the same theme in a bayesian framework. If you have accurate labeled data, you might as well use it.
In order to recognize and verbally name apples, a child must first have years of visual experience. Supervised DL systems trained from scratch need to learn everything from scratch, even the lowest level features. The object in these systems is not to maximize learning from small amounts of training data.
In the limited training data domain and more generally for mixed datasets where there is a large amount of unlabeled data, transfer learning and mixed UL/SL can do better.
Just discussing that here.
The only real surprising part of that paper is the “good model, poor sampling” section. It’s not clear how often their particular pathological special case actually shows up in practice. In general a solomonoff learner will not have that problem.
I suspect that a more robust sampling procedure could fix the mismatch. A robust sampler would be one that outputs samples according to their total probability as measured by encoding cost. This corrects the mismatch between the encoder and the sampler. Naively implemented this makes the sampling far more expensive, perhaps exponentially so, but nonetheless it suggests the problem is not fundamental.
Ok, but this is even more vague then. At least neural networks are a coherent class of algorithms, with lots of architectural variations and hyperparameters to tune, but still functionally similar. General Bayesian inference, on the other hand, is a broad framework with dozens types of algorithms for different tasks, based on different assumptions and with different functional structure.
You could as well say that once we formulated the theory of universal computation and we had the first digital computers up and running, then we had all the math figured out and it was just a matter of scaling up things. This was probably the sentiment at the famous Dartmouth conference in 1956 where they predicted that ten smart people brainstorming for two months could make significant advancements in multiple fundamental AI problems. I think that we know better now.
Supervised learning may be a special case of unsupervised learning but not the other way round. Currently we can only do supervised learning well, at least when when big data is available. There have been attempts to reduce unsupervised learning to supervised learning, which had some practical success in textual NLP (with neural language models and word vectors) but not in other domains such as vision and speech.
The paper I linked, IMHO, may shed some light on why this happened: one of the most popular evaluation measure and training objective, the negative log-likelihood (aka empirical cross-entropy), which captures well our intuition of what a good model must do in binary (or low-dimensional) classification tasks, may break down in the high-dimensional regime, typical of some unsupervised tasks such as sampling.
I’ve never seen a modern generative model generate realistic samples of natural images or speech. Text generation fares somewhat better, but it’s still far from anything able to pass a Turing test. By contrast, discriminative models for classification or regression trained on large supervised data can often achieve human-level or even super-human performances.
Well, duh, but a Solomonoff learner is uncomputable. Inside a Solomonoff learner there would be a simulation of every possible human looking at the samples, among an infinite number of other things.
I don’t agree with this memetic taxonomy. I consider neural networks to be mostly synonymous with algebraic tensor networks—general computational graphs over tensors. As such ANN describes a modeling language family, equivalent in expressibility to binary circuit models (and thus Turing universal) but considerably more computationally efficient. The tensor algebra abstraction more closely matches physical hardware reality.
So as a general computing paradagim or circuit model, ANNs can be combined with any approximate inference technique. Backprop on log-likelihood is just one obvious approx method.
Not quite, because it took longer for the math for inference/learning to be worked out, and even somewhat longer for efficient approximations—and indeed that work is still ongoing.
Regardless, even if all the math was available in 1956 it wouldn’t of mattered, as they still would have had to wait 60 years or so for efficient implementations (hardware + software).
To the extant that this is a problem in practice, it’s a problem with typical sampling, not the measure itself. As I mentioned earlier, I believe it can be solved by more advanced sampling techniques that respect total KC/Solomonoff probability. Using these hypothetical correct samplers, good models should always produce good samples.
That being said I agree that generative modelling and realistic sampling in particular is an area ripe for innovation.
You actually probably have seen this in the form of CG in realistic video games or films. Of course those models are hand crafted rather than learned probabilistic generative models. I believe that cross-fertilization of ideas/techniques from graphics and ML will transform both in the near future.
The current image generative models in ML are extremely weak when viewed as procedural graphics engines—for the most part they are just 2D image blenders.
How would you know that you have it “all figured out”?
Err… didn’t you just say that it’s not a software issue and we have already figured out what math to implement? What’s the problem?
Right… build a NN a mile wide and a mile deep and let ’er rip X-/
No, I never said it is not a software issue—because the distinction between software/hardware issues is murky at best, especially in the era of ML where most of the ‘software’ is learned automatically.
You are trolling now—cutting my quotes out of context.