I’ve started learning Machine Learning (he!), and upon reading the first chapter of the most famous textbook I was already gasping for air.
For someone like me who grew into probability with Jaynes’ book, seeing in the first chapter that algorithms are trained using multiple times the same data (cross-validation) was… annoying, let’s say (I actually screamed at the book).
Is there a sane textbook on machine learning? I don’t demand one that starts from objective bayesianism, that would be asking too much. But at least something that assumes bayesianism as a foundation? Pretty please?
For someone like me who grew into probability with Jaynes’ book, seeing in the first chapter that algorithms are trained using multiple times the same data (cross-validation) was… annoying, let’s say (I actually screamed at the book).
There’s two ways to train algorithms ‘multiple times’ on the same data. The bad one is data duplication, but cross-validation is the good one. Data duplication is the sort of thing that Jaynes would have been worried about, because it means you’re counting evidence from the same piece of data twice, thus your model has illusory precision.
But what does cross-validation do? There’s an issue called “overfitting,” where any statistical procedure performed on a training set will fit both the noise and the signal in the training set, but while the signal on a test set will presumably be the same, the noise will be different and thus the model will do worse. Single validation is when you split your data into two parts, the training set and the test set, so that you can see how well your model trained on the training set does on the test set. When there’s a tunable parameter in the training method, people will sometimes optimize the tunable parameter given data in the test set.*
But to do one split and leave it at that is wasteful. Cross-validation is when you partition the data many times, and fit many different models, and can thus talk about how the population of models behaves. In particular, consider the case of ‘leave-one-out’ cross-validation, where in a dataset of n points, we train n different models, each time using n-1 datapoints to fit the model parameters, test them on the 1 datapoint left out. This gives each individual model as much training data as possible while still leaving us a test dataset to determine how resilient to overfitting our model-generation procedure is.
* The principled way to do this is to split the data three times, into a training set (which the algorithm always has access to), a validation set (which the algorithm only has access to when setting the tunable parameters), and then a test set (which the algorithm never has access to, but is used to assess how well the model does after the tunable parameter has been optimized).
The training sample of size m is then used to compute the n-fold cross-validation error R_CV(θ) for a small number of possible values of θ. θ is next set to the value θ_0 for which R_CV(θ) is smallest and the algorithm is trained with the parameter setting θ_0 over the full training sample of size m
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called ‘hierarchical Bayesian models.’)
Instead of pulling a prior out of thin air for the hyperparameters, this asks the question “which hyperparameters are best for generalizing models to test sets outside the training set?”, which is a different question from “which parameters maximize the likelihood of this data?”
(I should add that some people call it ‘cross-tuning’ to report a model whose hyperparameters have been selected by this sort of process, if there’s no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as ‘cross-validation.’)
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?
But once they do have the hyperparameter in place, this is what they do—they fit the model on the full training data, so that they can make the most use of everything.
Unfortunately the discussion is above my current understanding. But by glancing at the comments I catched this:
remember that many people in machine learning are frequentist (or have not yet learned the Bayesian arts) and don’t really have any other means of tuning hyperparameters so they jump on whatever methods might be available.
Which explains why it’s going to be so difficult for me to learn ML. It’s like I’m forced to learn Aristotelian physics. Aaargh!
I’m very tempted to argue that it is! But what I wanted to convey is that it feels like I’m supposed to learn something which is manifestly inferior, in its logical foundation, than what is already known and available.
And maybe under the constraint of computational cost the finishing point of the Bayesian and the frequentist approach is the same, but where’s the proof?
Where’s the place where someone says: “This is Bayesian machine learning, but it’s computationally too costly. So by making this and this simplifying assumptions, we end up with frequentist machine learning.”?
Instead, what I read are things like: “In practice, Bayesian optimization has been shown to obtain better results in fewer experiments than grid search and random search” (from here).
I would urge you to follow ChristianKI’s advice, since I suspect you probably know much less than you think you know about either Bayesian or frequentist statistics. Perhaps you could explain in your own words why exactly it is clear that the ML book you are reading is “manifestly inferior” to your preferred approach?
Perhaps you could explain in your own words why exactly it is clear that the ML book you are reading is “manifestly inferior” to your preferred approach?
There is a bit of confusion here. I’m not stating that frequentist machine learning is inferior to Bayesian machine learning. I’m stating that Bayesian probability is superior to frequentist probability. How do I say this? Because in all the case that I know, either a Bayesian model can be reduced to a frequentist one or a Bayesian model gives more accurate prediction.
That said, not even this is a problem. Since I’m learning the subject, I’m not at the stage of saying “this sentence is wrong”. I’m at the stage of “this sentence doesn’t make sense in the context of Bayesianism”. So I’m asking “is there a book that teaches ML from a Bayesian point of view?”. The answer I’m discovering, appallingly but maybe not so, is no.
As for the fervent defence, under the premises elucidated in the comments, I hold none of the myths, so it doesn’t apply.
Because in all the case that I know, either a Bayesian model can be reduced to a frequentist one or a Bayesian model gives more accurate prediction.
I typically see this stated as “there is a Bayesian interpretation for every effective statistical technique.” As pointed out elsewhere, typically people use “frequentist” to mean “non-Bayesian,” which is not particularly effective as a classification.
So I’m asking “is there a book that teaches ML from a Bayesian point of view?”.
The answer I’m discovering, appallingly but maybe not so, is no.
Did you google Bayesian Machine Learning, or search for it on Amazon? Barber is a well-rated textbook available online for free. (I haven’t read it; Sebastien Bratieres thinks it’s comparable to Murphy, the second most popular ML book, which is Bayesian.) Incidentally, Bishop, the most popular ML book, is also Bayesian. You managed to find the only ML textbook I’ve seen which has, as a comment in one of the Amazon reviews, a positive comment that the book is not Bayesian!
The more meta point here is to not let a worldview shut you out from potentially useful resources. Yes, Bayesianism is the best philosophy of probability, but that does not mean it is the most effective practice of statistics, and excluding concepts or practices from your knowledge of statistics because of a disagreement on philosophy is parochial and self-limiting.
As pointed out elsewhere, typically people use “frequentist” to mean “non-Bayesian,” which is not particularly effective as a classification.
Reducing a frequentist model to a Bayesian one though it’s not a pointless excercise, since it elucidates the hidden assumptions, and at least you are better aware of its field of applicability.
Did you google Bayesian Machine Learning, or search for it on Amazon?
Only after buying the book I have :/
Bishop though seems a lot interesting, thanks!
The more meta point here is to not let a worldview shut you out from potentially useful resources.
Thankfully, I’m learning ML for my own education, it’s not something I need to practice right now.
You’re welcome! I should point out that the other words I was considering using to describe Bishop are “classic” and “venerable”—it’s not out of date (most actively used ML methods are surprisingly old), but you may want to read it in parallel with Barber. (In general, if you’ve never read textbooks in parallel before, I recommend it as a lesson in textbook design / pedagogy.)
But what I wanted to convey is that it feels like I’m supposed to learn something which is manifestly inferior, in its logical foundation, than what is already known and available.
I think it’s very useful to listen to be able to listen to someone with domain expertise telling you when you are wrong when you are a beginner.
But then I’m allowed to ask “why?”, and if the answer is “because I say so”, then I feel pretty confident to dismiss the expert.
But that’s not even the stage I’m at. A book is not an interactive medium, so the act has gone like this:
book: Cross-validation!
me: “Gaaaak! That sounds like totally wrong! Is there anyone that can explain me either why this is right or, if it’s actually wrong, what is the correct approach?”
Also, although in this case there seems to be an available answer, I don’t think it makes sense to always expect that. Sometimes people find a technique that tends to work in practice and then only later come up with a theoretical explanation of why it works. If you happen to live in the period in between...
He! I’ve suddenly remembered that LW was founded exactly because the fields of AI and ML used too much frequentist (il)logic. The Sequence was about to restore sanity in the field. Anyway, the textbook you mentioned seems pretty cool, thank you very much!
I’m no expert at machine learning. However as far as I remember the point of doing cross-validation is to find out whether your model is robust.
Robustness is not a standard “Bayesian” concept. Maybe you don’t appreciate it’s value?
I would appreciate if there was en explanation of why something is done the way it is. Instead it’s all about learning the passwords. Maybe it’s just that the main textbook in the field is pedagogically bad, it wouldn’t be the first time.
Getting deep understanding of a complex field like machine intelligence isn’t easy. You shouldn’t expect it to be easy and something that you can acquire in a few days.
This is probably very arrogant of me to say, but my advice would be: “Listen to the domain expert when he tells you what you should do… and then find a Bayesian and let them explain to you why that works.”
In my defense, this was my personal experience with statistics at school. I was very good at math in general, but statistics somehow didn’t “click”. I always had this feeling as if what was explained was built on some implicit assumptions that no one ever mentioned explicitly, so unlike with the rest of the math, I had no other choice here but to memorize that in a situation x you should do y, because, uhm, that’s what my teachers told me to do. -- More than ten years later, I read LW, and here I am told that yes, the statistics that I was taught does have implicit assumptions, and suddenly it all makes sense. And it makes me very angry that no one told me this stuff at school. -- I am a “deep learner” (this, not this), and I have problem learning something when I am told how, but I can’t find out why. Most people probably don’t have a problem with this, they are told how, and they do, and can be quite successful with it; and probably later they will also get an idea of why. But I need to understand the stuff from the very beginning, otherwise I can’t do it well. Telling me to trust a domain expert does not help; I may put a big confidence in how, but I still don’t know why.
ChristianKI is not telling you to trust a domain expert, but rather to read / listen to the domain expert long enough to understand what they are saying (rather than instantly assuming they are wrong because they say something that seems to conflict with your preconceived notions).
I think if you were to read most machine learning books, you would get quite a lot of “why”. See this manuscript for instance. I don’t really see why you think that Bayesians have a monopoly on being able to explain things.
I think you make a mistake if you put a school teacher who doesn’t understand statistics on a deep level into the same category of academic machine learning experts who don’t happen to be “Bayesians”.
There is the probabilistic programming community which uses clean tools (programming languages) to hand construct models with many unknown parameters. They use approximate bayesian methods for inference, and they are slowly improving the efficiency/scalability of those techniques.
Then there is the neural net & optimization community which uses general automated models. It is more ‘frequentist’ (or perhaps just ad-hoc ), but there are also now some bayesian inroads there. That community has the most efficient/scalable learning methods, but it isn’t always clear what tradeoffs they are making.
And even in the ANN world, you sometimes see bayesian statistics brought in to justify regularizers or to derive stuff—such as in variational methods. But then for actual learning they take gradients and use SGD, with the understanding that SGD is somehow approximating the bayesian inference step, or at least doing something close enough.
Eventually it makes sense, I promise. “Bayesianism” in the sense of keeping track of every hypothesis is very computationally expensive—modern algorithms only keep track of a very small number of hypotheses (only those representable by a neural network [or what have you], and even then only those required to do gradient descent). This fact opens you up to the overfitting problem, where the simplest perfect hypothesis in your space actually has very little information about the true external reality. You need some way of throwing away the parts of the signal that your model wasn’t going to figure out anyhow.
For this reason among others, modern machine learning algorithms often have a lot of settings that have to be set by smarter systems (humans), before your algorithm can actually learn a novel domain. These settings reflect how the properties of the domain interact with properties of your algorithm (e.g. how many resources the algorithm has to commit before it can expect to have found something good, or what degree of noise the algorithm has to learn to throw away). These are those “hyperparameter” things. Cross-validation is just an empirical tool that helps humans figure out the right settings. You can probably figure out why it’s expected to work.
I upvoted because I understand the rationale, I understand the explanation, I just rather wish that a book whose purpose is to teach the subject wouldn’t be so… ad hoc.
I’ve started learning Machine Learning (he!), and upon reading the first chapter of the most famous textbook I was already gasping for air.
For someone like me who grew into probability with Jaynes’ book, seeing in the first chapter that algorithms are trained using multiple times the same data (cross-validation) was… annoying, let’s say (I actually screamed at the book).
Is there a sane textbook on machine learning? I don’t demand one that starts from objective bayesianism, that would be asking too much. But at least something that assumes bayesianism as a foundation? Pretty please?
There’s two ways to train algorithms ‘multiple times’ on the same data. The bad one is data duplication, but cross-validation is the good one. Data duplication is the sort of thing that Jaynes would have been worried about, because it means you’re counting evidence from the same piece of data twice, thus your model has illusory precision.
But what does cross-validation do? There’s an issue called “overfitting,” where any statistical procedure performed on a training set will fit both the noise and the signal in the training set, but while the signal on a test set will presumably be the same, the noise will be different and thus the model will do worse. Single validation is when you split your data into two parts, the training set and the test set, so that you can see how well your model trained on the training set does on the test set. When there’s a tunable parameter in the training method, people will sometimes optimize the tunable parameter given data in the test set.*
But to do one split and leave it at that is wasteful. Cross-validation is when you partition the data many times, and fit many different models, and can thus talk about how the population of models behaves. In particular, consider the case of ‘leave-one-out’ cross-validation, where in a dataset of n points, we train n different models, each time using n-1 datapoints to fit the model parameters, test them on the 1 datapoint left out. This gives each individual model as much training data as possible while still leaving us a test dataset to determine how resilient to overfitting our model-generation procedure is.
* The principled way to do this is to split the data three times, into a training set (which the algorithm always has access to), a validation set (which the algorithm only has access to when setting the tunable parameters), and then a test set (which the algorithm never has access to, but is used to assess how well the model does after the tunable parameter has been optimized).
Allow me to quote directly from the book:
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called ‘hierarchical Bayesian models.’)
Instead of pulling a prior out of thin air for the hyperparameters, this asks the question “which hyperparameters are best for generalizing models to test sets outside the training set?”, which is a different question from “which parameters maximize the likelihood of this data?”
(I should add that some people call it ‘cross-tuning’ to report a model whose hyperparameters have been selected by this sort of process, if there’s no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as ‘cross-validation.’)
If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?
But once they do have the hyperparameter in place, this is what they do—they fit the model on the full training data, so that they can make the most use of everything.
^ Above post is the illustration of the danger of LW’s style Bayes. Below is a non-crazy discussion (e.g. one where people don’t scream):
http://andrewgelman.com/2013/12/10/cross-validation-bayesian-estimation-tuning-parameters/
Unfortunately the discussion is above my current understanding. But by glancing at the comments I catched this:
Which explains why it’s going to be so difficult for me to learn ML. It’s like I’m forced to learn Aristotelian physics. Aaargh!
The relationship between F and B is not like the relationship between Aristotelian physics and relativity. Not at all.
I’m very tempted to argue that it is!
But what I wanted to convey is that it feels like I’m supposed to learn something which is manifestly inferior, in its logical foundation, than what is already known and available.
And maybe under the constraint of computational cost the finishing point of the Bayesian and the frequentist approach is the same, but where’s the proof? Where’s the place where someone says: “This is Bayesian machine learning, but it’s computationally too costly. So by making this and this simplifying assumptions, we end up with frequentist machine learning.”?
Instead, what I read are things like: “In practice, Bayesian optimization has been shown to obtain better results in fewer experiments than grid search and random search” (from here).
I would urge you to follow ChristianKI’s advice, since I suspect you probably know much less than you think you know about either Bayesian or frequentist statistics. Perhaps you could explain in your own words why exactly it is clear that the ML book you are reading is “manifestly inferior” to your preferred approach?
Also consider reading this: A Fervent Defense of Frequentist Statistics.
There is a bit of confusion here. I’m not stating that frequentist machine learning is inferior to Bayesian machine learning. I’m stating that Bayesian probability is superior to frequentist probability.
How do I say this? Because in all the case that I know, either a Bayesian model can be reduced to a frequentist one or a Bayesian model gives more accurate prediction.
That said, not even this is a problem. Since I’m learning the subject, I’m not at the stage of saying “this sentence is wrong”. I’m at the stage of “this sentence doesn’t make sense in the context of Bayesianism”. So I’m asking “is there a book that teaches ML from a Bayesian point of view?”.
The answer I’m discovering, appallingly but maybe not so, is no.
As for the fervent defence, under the premises elucidated in the comments, I hold none of the myths, so it doesn’t apply.
I typically see this stated as “there is a Bayesian interpretation for every effective statistical technique.” As pointed out elsewhere, typically people use “frequentist” to mean “non-Bayesian,” which is not particularly effective as a classification.
Did you google Bayesian Machine Learning, or search for it on Amazon? Barber is a well-rated textbook available online for free. (I haven’t read it; Sebastien Bratieres thinks it’s comparable to Murphy, the second most popular ML book, which is Bayesian.) Incidentally, Bishop, the most popular ML book, is also Bayesian. You managed to find the only ML textbook I’ve seen which has, as a comment in one of the Amazon reviews, a positive comment that the book is not Bayesian!
The more meta point here is to not let a worldview shut you out from potentially useful resources. Yes, Bayesianism is the best philosophy of probability, but that does not mean it is the most effective practice of statistics, and excluding concepts or practices from your knowledge of statistics because of a disagreement on philosophy is parochial and self-limiting.
Reducing a frequentist model to a Bayesian one though it’s not a pointless excercise, since it elucidates the hidden assumptions, and at least you are better aware of its field of applicability.
Only after buying the book I have :/ Bishop though seems a lot interesting, thanks!
Thankfully, I’m learning ML for my own education, it’s not something I need to practice right now.
You’re welcome! I should point out that the other words I was considering using to describe Bishop are “classic” and “venerable”—it’s not out of date (most actively used ML methods are surprisingly old), but you may want to read it in parallel with Barber. (In general, if you’ve never read textbooks in parallel before, I recommend it as a lesson in textbook design / pedagogy.)
Using Bishop in my class this Fall, very popular for good reason.
I think it’s very useful to listen to be able to listen to someone with domain expertise telling you when you are wrong when you are a beginner.
But then I’m allowed to ask “why?”, and if the answer is “because I say so”, then I feel pretty confident to dismiss the expert.
But that’s not even the stage I’m at. A book is not an interactive medium, so the act has gone like this:
book: Cross-validation!
me: “Gaaaak! That sounds like totally wrong! Is there anyone that can explain me either why this is right or, if it’s actually wrong, what is the correct approach?”
I’m still searching for an answer...
Try this paper or page 403 of this textbook.
Also, although in this case there seems to be an available answer, I don’t think it makes sense to always expect that. Sometimes people find a technique that tends to work in practice and then only later come up with a theoretical explanation of why it works. If you happen to live in the period in between...
He! I’ve suddenly remembered that LW was founded exactly because the fields of AI and ML used too much frequentist (il)logic. The Sequence was about to restore sanity in the field.
Anyway, the textbook you mentioned seems pretty cool, thank you very much!
I’m no expert at machine learning. However as far as I remember the point of doing cross-validation is to find out whether your model is robust. Robustness is not a standard “Bayesian” concept. Maybe you don’t appreciate it’s value?
I would appreciate if there was en explanation of why something is done the way it is. Instead it’s all about learning the passwords. Maybe it’s just that the main textbook in the field is pedagogically bad, it wouldn’t be the first time.
Getting deep understanding of a complex field like machine intelligence isn’t easy. You shouldn’t expect it to be easy and something that you can acquire in a few days.
This is probably very arrogant of me to say, but my advice would be: “Listen to the domain expert when he tells you what you should do… and then find a Bayesian and let them explain to you why that works.”
In my defense, this was my personal experience with statistics at school. I was very good at math in general, but statistics somehow didn’t “click”. I always had this feeling as if what was explained was built on some implicit assumptions that no one ever mentioned explicitly, so unlike with the rest of the math, I had no other choice here but to memorize that in a situation x you should do y, because, uhm, that’s what my teachers told me to do. -- More than ten years later, I read LW, and here I am told that yes, the statistics that I was taught does have implicit assumptions, and suddenly it all makes sense. And it makes me very angry that no one told me this stuff at school. -- I am a “deep learner” (this, not this), and I have problem learning something when I am told how, but I can’t find out why. Most people probably don’t have a problem with this, they are told how, and they do, and can be quite successful with it; and probably later they will also get an idea of why. But I need to understand the stuff from the very beginning, otherwise I can’t do it well. Telling me to trust a domain expert does not help; I may put a big confidence in how, but I still don’t know why.
ChristianKI is not telling you to trust a domain expert, but rather to read / listen to the domain expert long enough to understand what they are saying (rather than instantly assuming they are wrong because they say something that seems to conflict with your preconceived notions).
I think if you were to read most machine learning books, you would get quite a lot of “why”. See this manuscript for instance. I don’t really see why you think that Bayesians have a monopoly on being able to explain things.
I think you make a mistake if you put a school teacher who doesn’t understand statistics on a deep level into the same category of academic machine learning experts who don’t happen to be “Bayesians”.
Ok, thank you for your time.
There is the probabilistic programming community which uses clean tools (programming languages) to hand construct models with many unknown parameters. They use approximate bayesian methods for inference, and they are slowly improving the efficiency/scalability of those techniques.
Then there is the neural net & optimization community which uses general automated models. It is more ‘frequentist’ (or perhaps just ad-hoc ), but there are also now some bayesian inroads there. That community has the most efficient/scalable learning methods, but it isn’t always clear what tradeoffs they are making.
And even in the ANN world, you sometimes see bayesian statistics brought in to justify regularizers or to derive stuff—such as in variational methods. But then for actual learning they take gradients and use SGD, with the understanding that SGD is somehow approximating the bayesian inference step, or at least doing something close enough.
Eventually it makes sense, I promise. “Bayesianism” in the sense of keeping track of every hypothesis is very computationally expensive—modern algorithms only keep track of a very small number of hypotheses (only those representable by a neural network [or what have you], and even then only those required to do gradient descent). This fact opens you up to the overfitting problem, where the simplest perfect hypothesis in your space actually has very little information about the true external reality. You need some way of throwing away the parts of the signal that your model wasn’t going to figure out anyhow.
For this reason among others, modern machine learning algorithms often have a lot of settings that have to be set by smarter systems (humans), before your algorithm can actually learn a novel domain. These settings reflect how the properties of the domain interact with properties of your algorithm (e.g. how many resources the algorithm has to commit before it can expect to have found something good, or what degree of noise the algorithm has to learn to throw away). These are those “hyperparameter” things. Cross-validation is just an empirical tool that helps humans figure out the right settings. You can probably figure out why it’s expected to work.
I upvoted because I understand the rationale, I understand the explanation, I just rather wish that a book whose purpose is to teach the subject wouldn’t be so… ad hoc.