The random forest is a machine learning algorithm designed to obtain a reliable prediction thanks to a system of random subspaces.

## What is the random forest algorithm?

Used in machine learning, the random forest or random forest is a prediction algorithm created in 1995 by Ho, then formally proposed by scientists Adele Cutler and Leo Breiman in 2001. As we will see, it combines the notions of subspaces random and bagging.

The random forest is composed of several decision trees, trained independently on subsets of the learning data set (bagging method). Each produces an estimate, and it is the combination of the results that will give the final prediction which results in a reduced variance. In short, it is a question of taking inspiration from different opinions, dealing with the same problem, in order to better understand it. Each model is randomly distributed into subsets of decision trees.

## How does a regression random forest work?

A random forest is used to apply regression. Based on a bagging system (see above), the regression random forest schematically consists in calculating the average of the forecasts obtained by all the estimates of the decision trees of the random forest.

## How does a classification random forest work?

In a classification random forest, the final estimate consists of choosing the most frequent response category. Rather than using all the results obtained, a selection is made by looking for the forecast that comes up most often.

As you will have understood, the main advantage of these trees is their simplicity. It suffices to predefine the discriminating factors. These allow us to build our tree. Its nodes will be the tests relating to these factors (“Does the candidate have feathers?”, “What color is it?”) and for leaves the final classes/choices (“The candidate is a fish », « The candidate is a bird »).

What if, even better, you don’t even have to think about the different tests that will predict the class to which your candidate belongs? There are indeed algorithms which, from a set of initial data (which serve as learning examples), can deduce the tests which will be relevant to determine the class of your entry.

The interest of the tree being its simplicity, it seems natural to limit the number of nodes. This makes it possible to find out as quickly as possible which class the candidate belongs to. The construction of the tree is then done step by step, always seeking to answer this question: which attribute makes it possible to maximize the gain of information during this step?

## Construction of the tree, introduction of the notion of entropy

To answer this question, it is necessary to introduce the notion of entropy. For the most physicists among you, this term is familiar to you: entropy is an indicator of disorder. The higher it is, the more “bazaar” it is. It is given here by the number of candidates belonging to different classes following the test.

For example, a test that splits a class into two non-final classes of the same size has high entropy because there will still be a lot of tests to conduct. On the contrary, if all the objects finally belong to the same class, there is no disorder: we have arrived at a sheet, everything is in its place. The construction is done recursively (this is called “top-down induction”). Initially, our tree is empty. Then, the algorithm selects the test that generates the minimum entropy, and makes it its root.

The algorithm is then interested in one of the children of this root, and repeats the process to choose which test will be done at this node, and so on. Finally, if all the examples that remain to be processed belong to the same class (therefore we have zero entropy), or if there is no more test to perform, it means that we have arrived at a sheet , and the algorithm finishes: your decision tree is ready, the art of divination is yours.

## The tree that hides the forest: the random forest algorithm

As you have seen, we build them step by step, taking the wisest test at each step. Such an approach is common in computer science and is called a “greedy algorithm”. However, if this method is considered a satisfactory first naive approach, the fact remains that it can be very (very!) far from the optimal result. This is because each test only appears once in the tree. **

To overcome this problem, Leo Breiman and Adele Cutler proposed in 2001 the notion of random forests or random forest. The idea is to create a large number of decision trees randomly, from different subsets of the initial dataset. Considering different subsets is important: it reduces the risk of error, since our trees will be poorly correlated. This also avoids the problem of overfitting, which occurs when the constructed tree is too adapted to the sample considered.

He will have considered all the possible tests, each sheet representing only one candidate. This tree will therefore be 100% reliable for the sample that allowed its construction (since it takes into account all the possible tests), but will not be generalizable to other samples.

The candidate is then tested on each of these trees (which can be more than a hundred). The interest of the forest is to proceed by majority vote as to the results obtained. This reduces the margin of error that a single tree can have. The more trees there are, the more reliable the forest will be.

## Avoid Overfitting

Another reason why random forest is so effective is that it helps reduce overfitting. And this is also one of the reasons that motivated the use of several trees. Decision trees have an architecture that allows them to stick perfectly to the training data. This is what is required of a supervised learning algorithm. Nevertheless, from a certain point, the model becomes too much influenced by the training data, which can generate biases.

Using multiple shafts provides greater flexibility and helps reduce this problem. It is the same principle as for a crowd of people. The psychology of crowds shows that an intelligence stronger than human intelligence emerges from a group of people. This is for example what makes that during a concert, the public always sings in tune while the majority of people sing out of tune. Each person corrects the errors of his neighbors, as with trees. Be careful though, it is still very easy to fall into overfitting with Random forest. More than with other methods like XGBoost.

## Random forest is an algorithm with strong interpretability

One of the great challenges of machine learning in recent years is the explainability (or interpretability) of models. There is also a dilemma between efficiency and explicability of models. Often the most efficient models, such as neural networks with deep learning, are the least explainable.

Random Forest is a more than credible solution to this dilemma. Its effectiveness is quite good and we have techniques to interpret the results. For example, we can determine which features were decisive for obtaining a prediction. Random forest offers better transparency on the use made of training data.

Finally, random forest has the advantage of using all of its initial data more intelligently, in order to limit its errors. However, the algorithm can never be 100% reliable (we would need an infinity of trees) and loses the simplicity of our decision tree alone.

## Random Forest vs. Gradient Boosting

The gradient boosting machine (GBM) works on a model similar to bagging, but unlike a classic random forest, it adjusts the learning base as training progresses. It uses the gradient of the loss function (or gradient descent) to correct the errors of the previous tree and thus optimize the results of the model. Gradient boosting is particularly used in anomaly detection, a case in which the data is often unbalanced.

## ABOUT LONDON DATA CONSULTING (LDC)

We, at London Data Consulting (LDC), provide all sorts of Data Solutions. This includes Data Science (AI/ML/NLP), Data Engineer, Data Architecture, Data Analysis, CRM & Leads Generation, Business Intelligence and Cloud solutions (AWS/GCP/Azure).