Do you know of any good resources for work exploring optimality and generality proofs for practical unsupervised learners?
Yes. There were a couple of DL theory papers in the last year or so that generated some excitement on r/machinelearning:
“Provable bounds for learning some deep representations.”
“On the computational efficiency of training neural networks.”
I’m not a huge fan of hard theory and optimality proofs, but even I found the first paper in particular somewhat insightful.
Machine learning was somewhat obsessed with formal theory (provable bounds, convex learning, etc) in the last decade, and I believe it held back progress.
The problem is that to prove any hard results, you typically have to simplify the model down to the point at which it loses all relevance to solving real world problems.
The recent advances in deep learning come in part from the scruffies/experimental researchers saying “screw hard theory” and just forging ahead, and also in part from advances in informal theoretical understanding (what one may call wisdom rather than math). The work of Bengio and Hinton in particular is rich with that kind of wisdom.
In particular see this comment from hinton, and a related comment here from Bengio.
The recent advances in deep learning come in part from the scruffies/experimental researchers saying “screw hard theory” and just forging ahead
Yeah I get that vibe. But ultimately I’m looking into this because I want to understand the top-down picture, not because of specific applications (Well, integrating logical uncertainty with induction would be nice, but if that works it’s more of a theoretical milestone than an application). Research on neural networks is actually more specific than what I want to read right now, but it looks like that’s what’s available :P (barring a literature search I should do before reading more than a few NN papers)
One of the issues is that if you try to prove generalization bounds for anything that has an universal representation property (e.g. neural networks) you always get terrible bounds that don’t reflect empirical performance on real-world learning tasks.
I think this is because there are probably some learning tasks which are intrinsically intractable (this is certainly the case if one-way functions exist), therefore any bound that quantifies over all learning tasks will be at least exponential.
But ultimately I’m looking into this because I want to understand the top-down picture, not because of specific applications
For that, I really really recommend “Representation Learning”. Quote from the abstract:
“This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.”
It really covers the big picture view of the entirety of machine learning, and from multiple viewpoints (statistical, geometric, optimization, etc).
Well, integrating logical uncertainty with induction would be nice, but if that works it’s more of a theoretical milestone than an application)
That’s done or being done. Look into “probabilistic programming”.
That’s done or being done. Look into “probabilistic programming”.
This doesn’t look like what I want—have you read my take on logical uncertainty? I still stand by the problem statement (though I’m more uncertain now about minimizing Shannon vs. Kolmogorov information), even if I no longer quite approve of the proposed solution.
EDIT: If I were to point to what I no longer approve of, it would be that while in the linked posts I focus on what’s mandatory for limited reasoners, I now think it should be possible to look at what’s actually a good idea for limited reasoners.
Probabilistic programming seems to assign distributions to programs that involve random variables, but logical uncertainty assigns distributions to programs without random variables :P
This doesn’t look like what I want—have you read my take on logical uncertainty?
A little—yes approximation is key for practical performance. This is why neural nets and related models work so well—they allow one to efficiently search over model/function/program space for the best approximate models that fit memory/compute constraints.
Probabilistic programming seems to assign distributions to programs that involve random variables, but logical uncertainty assigns distributions to programs without random variables :P
Code and variables are interchangeable, so of course prob programming can model distributions over programs. For example, I could create a VM or interpreter with a big array of ‘variables’ that define opcodes. Graphical models and networks also encode programs as variables. There is no hard distinction between data/code/variables.
To give an example of what I mean, if a probabilistic programming language really implemented logical uncertainty, you could write a deterministic program that computes the last digit of the googolth prime number (possible outputs 0 through 9), and then use some getDistribution method on this deterministic program, and it would return the distribution {0.25 chance of 1, 0.25 chance of 3, 0.25 chance of 7, 0.25 chance of 9} (unless your computer actually has enough power to calculate the googolth prime).
I think your example is—at least theoretically—still within the reach of a probabilistic logic program, although the focus is entirely on approximation rather than conditioning (there are no observations). Of course, I doubt any existing inference engine would be able to help much with problems of that type, but the idea is still general enough to cover estimating intractable functions from the logic defining the solution.
Yes. There were a couple of DL theory papers in the last year or so that generated some excitement on r/machinelearning:
“Provable bounds for learning some deep representations.”
“On the computational efficiency of training neural networks.”
I’m not a huge fan of hard theory and optimality proofs, but even I found the first paper in particular somewhat insightful.
Machine learning was somewhat obsessed with formal theory (provable bounds, convex learning, etc) in the last decade, and I believe it held back progress.
The problem is that to prove any hard results, you typically have to simplify the model down to the point at which it loses all relevance to solving real world problems.
The recent advances in deep learning come in part from the scruffies/experimental researchers saying “screw hard theory” and just forging ahead, and also in part from advances in informal theoretical understanding (what one may call wisdom rather than math). The work of Bengio and Hinton in particular is rich with that kind of wisdom.
In particular see this comment from hinton, and a related comment here from Bengio.
Yeah I get that vibe. But ultimately I’m looking into this because I want to understand the top-down picture, not because of specific applications (Well, integrating logical uncertainty with induction would be nice, but if that works it’s more of a theoretical milestone than an application). Research on neural networks is actually more specific than what I want to read right now, but it looks like that’s what’s available :P (barring a literature search I should do before reading more than a few NN papers)
One of the issues is that if you try to prove generalization bounds for anything that has an universal representation property (e.g. neural networks) you always get terrible bounds that don’t reflect empirical performance on real-world learning tasks.
I think this is because there are probably some learning tasks which are intrinsically intractable (this is certainly the case if one-way functions exist), therefore any bound that quantifies over all learning tasks will be at least exponential.
For that, I really really recommend “Representation Learning”. Quote from the abstract:
“This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.”
It really covers the big picture view of the entirety of machine learning, and from multiple viewpoints (statistical, geometric, optimization, etc).
That’s done or being done. Look into “probabilistic programming”.
This doesn’t look like what I want—have you read my take on logical uncertainty? I still stand by the problem statement (though I’m more uncertain now about minimizing Shannon vs. Kolmogorov information), even if I no longer quite approve of the proposed solution.
EDIT: If I were to point to what I no longer approve of, it would be that while in the linked posts I focus on what’s mandatory for limited reasoners, I now think it should be possible to look at what’s actually a good idea for limited reasoners.
Probabilistic programming seems to assign distributions to programs that involve random variables, but logical uncertainty assigns distributions to programs without random variables :P
A little—yes approximation is key for practical performance. This is why neural nets and related models work so well—they allow one to efficiently search over model/function/program space for the best approximate models that fit memory/compute constraints.
Code and variables are interchangeable, so of course prob programming can model distributions over programs. For example, I could create a VM or interpreter with a big array of ‘variables’ that define opcodes. Graphical models and networks also encode programs as variables. There is no hard distinction between data/code/variables.
Hm, I think we’re talking past each other.
To give an example of what I mean, if a probabilistic programming language really implemented logical uncertainty, you could write a deterministic program that computes the last digit of the googolth prime number (possible outputs 0 through 9), and then use some getDistribution method on this deterministic program, and it would return the distribution {0.25 chance of 1, 0.25 chance of 3, 0.25 chance of 7, 0.25 chance of 9} (unless your computer actually has enough power to calculate the googolth prime).
I think your example is—at least theoretically—still within the reach of a probabilistic logic program, although the focus is entirely on approximation rather than conditioning (there are no observations). Of course, I doubt any existing inference engine would be able to help much with problems of that type, but the idea is still general enough to cover estimating intractable functions from the logic defining the solution.