Start with the base model, the decision tree. It’s simple and provides representations that may be actually understandable, which is rare in ML, but it has a problem: it sucks. Well, not always, but for many tasks it sucks. Its main limitation is that it can’t efficiently represent linear relations unless the underlying hyperplane is parallel to one of the input feature axes. And most practical tasks involve linear relations + a bit of non-linearity. Training a decision tree on these tasks tends to yield very large trees that overfit (essentially, you end up storing the training set in the tree which then acts like a lookup table).
Fortunately, it was discovered that if you take a linear combination of the outputs a sizeable-but-not-exceptionally-large number of appropriately trained decision trees, then you can get good performances on real-world tasks. In fact it turns out that the coefficients of the linear combination aren’t terribly important, a simple averaging will do.
So the issue is how do you appropriately train these decision trees. You want these trees to be independent from each other conditioned on the true relation as much as possible. This means that ideally you would have to train each of them on a different training set sampled from the underlying true distribution, that is, you would have to have enough training data for each tree. But training data is expensive (ok, used to be expensive in the pre-big data era) and we want to learn an effective model from as few data as possible. The second requirement is that each decision tree must not overfit. In the tradeoff between overfitting and underfitting, you prefer underfitting the invdividual models, since model averaging at the end can take care of it.
Random forests use two tricks to fulfill these requirements:
The first one is Bootstrap aggregating, aka “bagging”: instead of gathering from the true distribution m training sets of n examples each for each of your m decision trees, you generate m-1 alternate sets by resampling with replacement your original training set.of n examples. It turns out that, for reasons not entirely well understood, these m datasets behave in many ways as if they were independently sampled from the true distribution. This is an application of a technique known in statistics as Bootstrapping), which has some asymptotic theoretical guarantees under certain conditions that probably don’t apply here, but nevertheless empirically it often works well.
The second one is the Random subspace method, which is just a fancy term for throwing features away at random, in a different way for each decision tree. This makes it more difficult for each decision tree to overfit, since it has a reduced number of degrees of freedom, and specifically it makes it more difficult to get high training accuracy by relying on recognizing some spurious pattern that appears in the training set but is disrupted by throwing some features away. Note that you are not throwing away some features from the whole model. It’s only the internal decision trees that each train on limited information, but the model overall still trains, with high probability, on all the information contained in all features. The individual trees underfit compared to trees trained on all features, but the averaging at the end compensates for this. Yes, there are some tasks where throwing features away is guaranteed to hurt the accuracy of the individual decision trees to the point of making the task impossible, e.g. the parity problem, but for practical tasks, again for reasons not entirely well understood, it works reasonably well.
With these tricks random forests manage to be the state of the art technique for a large class of ML tasks: any supervised task (in particular classification) that is difficult enough that simple linear methods won’t suffice, but is not difficult enough that you need a very big dataset (or is that difficult but the big dataset is not available) where neural networks would dominate (or a task that has logical depth greater that three, like the aforementioned parity problem, although it’s not clear how common these are in practice).
A full understanding of why random forests work would require a Bayesian argument with an assumption about the prior on the data distribution (Solomonoff? Levin? something else?). This is not currently known for random forests and in fact AFAIK has been done only for very simple ML algorithms using simplifying assumptions on the data distribution such as Gaussianity. If that’s the level of rigor you are looking for then I’m afraid that you are not going to find it in the discussion of any practical ML algorithm, at least so far. If you enjoy the discussion of math/statistics based methods even if there are some points that are only justified by empirical evidence rather than proof, then you may find the field interesting.
Start with the base model, the decision tree. It’s simple and provides representations that may be actually understandable, which is rare in ML, but it has a problem: it sucks. Well, not always, but for many tasks it sucks. Its main limitation is that it can’t efficiently represent linear relations unless the underlying hyperplane is parallel to one of the input feature axes. And most practical tasks involve linear relations + a bit of non-linearity. Training a decision tree on these tasks tends to yield very large trees that overfit (essentially, you end up storing the training set in the tree which then acts like a lookup table).
Fortunately, it was discovered that if you take a linear combination of the outputs a sizeable-but-not-exceptionally-large number of appropriately trained decision trees, then you can get good performances on real-world tasks. In fact it turns out that the coefficients of the linear combination aren’t terribly important, a simple averaging will do.
So the issue is how do you appropriately train these decision trees.
You want these trees to be independent from each other conditioned on the true relation as much as possible. This means that ideally you would have to train each of them on a different training set sampled from the underlying true distribution, that is, you would have to have enough training data for each tree. But training data is expensive (ok, used to be expensive in the pre-big data era) and we want to learn an effective model from as few data as possible.
The second requirement is that each decision tree must not overfit. In the tradeoff between overfitting and underfitting, you prefer underfitting the invdividual models, since model averaging at the end can take care of it.
Random forests use two tricks to fulfill these requirements:
The first one is Bootstrap aggregating, aka “bagging”: instead of gathering from the true distribution m training sets of n examples each for each of your m decision trees, you generate m-1 alternate sets by resampling with replacement your original training set.of n examples. It turns out that, for reasons not entirely well understood, these m datasets behave in many ways as if they were independently sampled from the true distribution.
This is an application of a technique known in statistics as Bootstrapping), which has some asymptotic theoretical guarantees under certain conditions that probably don’t apply here, but nevertheless empirically it often works well.
The second one is the Random subspace method, which is just a fancy term for throwing features away at random, in a different way for each decision tree.
This makes it more difficult for each decision tree to overfit, since it has a reduced number of degrees of freedom, and specifically it makes it more difficult to get high training accuracy by relying on recognizing some spurious pattern that appears in the training set but is disrupted by throwing some features away. Note that you are not throwing away some features from the whole model. It’s only the internal decision trees that each train on limited information, but the model overall still trains, with high probability, on all the information contained in all features. The individual trees underfit compared to trees trained on all features, but the averaging at the end compensates for this.
Yes, there are some tasks where throwing features away is guaranteed to hurt the accuracy of the individual decision trees to the point of making the task impossible, e.g. the parity problem, but for practical tasks, again for reasons not entirely well understood, it works reasonably well.
With these tricks random forests manage to be the state of the art technique for a large class of ML tasks: any supervised task (in particular classification) that is difficult enough that simple linear methods won’t suffice, but is not difficult enough that you need a very big dataset (or is that difficult but the big dataset is not available) where neural networks would dominate (or a task that has logical depth greater that three, like the aforementioned parity problem, although it’s not clear how common these are in practice).
A full understanding of why random forests work would require a Bayesian argument with an assumption about the prior on the data distribution (Solomonoff? Levin? something else?). This is not currently known for random forests and in fact AFAIK has been done only for very simple ML algorithms using simplifying assumptions on the data distribution such as Gaussianity. If that’s the level of rigor you are looking for then I’m afraid that you are not going to find it in the discussion of any practical ML algorithm, at least so far. If you enjoy the discussion of math/statistics based methods even if there are some points that are only justified by empirical evidence rather than proof, then you may find the field interesting.