One of the trickiest obstacles in machine learning is overfitting. A Machine Learning model must be able to generalize, that is, it must be able to take new observations and predict with some accuracy the associated target variable.

Sometimes this is not the case, because the model is too simple and does not take into account some important factors, or conversely, it is too complex and begins to overfit the data. There are several methods to detect and avoid overfitting including cross-validation, hyperparameter tuning, etc.

## What is Overfitting?

When machine learning algorithms are built, they use a sample of data to train the model. However, when the model trains too long on sample data or when the model is too complex, it can start to learn “noise” or irrelevant information in the dataset. When the model memorizes noise and fits the training set too tightly, it becomes “overabundant” and is unable to generalize well to new data. If a model cannot generalize well to new data, it will not be able to perform the classification or prediction tasks for which it was intended.

Ideally, the case where the model makes the predictions with virtually zero error is said to have a good fit on the data. This situation is achievable somewhere between overfitting and underfitting. To better understand, we will take the following example of a linear regression model with polynomial characteristics.

First, we will generate randomized data from a cosine function, on which we will train a linear regression model.

`import numpy as np`

def true_fun(X):

return np.cos(1.3 * np.pi * X)

$X$ represents the ordinates of the points in the dataset and $y$ represents the abscissas.

`np.random.seed(0)`

n_samples = 20

X = np.sort(np.random.rand(n_samples))

y = true_fun(X) + np.random.randn(n_samples) * 0.1

If we visualize the data obtained using the matplotlib library, we end up with the following point cloud.

`import matplotlib.pyplot as plt`

plt.plot(X, y, color = 'red', label="Actual")

plt.scatter(X, y, edgecolor='b', s=20, label="Samples")

plt.xlabel("x")

plt.ylabel("y")

plt.legend(loc="best")

plt.show()

We then want to find a mathematical function $f$ mapping from input to output, where the result $f(x)$ of an observation $x$ corresponds to the response $y$ of the model’s prediction. We will try to train this model on the data we have generated and each time, we vary the degree of the polynomial function $f$ and we visualize the results.

`from sklearn.pipeline import Pipeline`

from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LinearRegression

`plt.figure(figsize=(14, 5))`

`for i in range(len(degrees)):`

ax = plt.subplot(1, len(degrees), i + 1)

plt.setp(ax, xticks=(), yticks=())

```
polynomial_features=PolynomialFeatures(degree=degrees[i],include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features("linear_regression", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
X_test = np.linspace(0, 1, 100)
y_poly_pred = pipeline.predict(X_test[:, np.newaxis])
plt.plot(X_test, y_poly_pred, label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Degree {}".format(degrees[i]))
plt.show()
```

**Output:**

It can be seen that a linear function (polynomial with degree 1) is not sufficient to fit the training samples. This is called underfitting.

A polynomial of degree 3 approximates the true function almost perfectly. This is called good fit. In this case, the model represents neither an underfitting nor an overfitting.

However, for higher degrees (with degree 12), the model will overflow the training dataset, i.e. it learns noise from the training data. This is overfitting, the model in this case learns the elements of the dataset by heart and takes into consideration all the extreme points. He will not be able to generalize.

## How To Detect Overfitting?

Overfitting is almost impossible to detect before testing the model with the test data. To do this, we can split the dataset into two subsamples:

- A
**training set (train set)**, on which the model will learn. - A
**test set**, on which we will evaluate the performance of the model.

Overfitting can be detected by monitoring model performance on training and test data over time. If the model’s performance on the training data continues to improve while that on the test data declines, this indicates overfitting.

For example, if the model ran with 99% accuracy on the training dataset, but only 50-55% accuracy on the test dataset. The significant difference between these two scores indicates that an overfitting has taken place.

Usually it is considered that the test set contains about 30% of the dataset and the rest for the training set (subject to having enough data). We will eventually hear about a validation set, less used in practice, but which makes it possible to test several models according to the same sample.

Another easy way to detect this is to use cross-validation. There are several cross-validation methods which make it possible to effectively measure the performance of the model on unknown observations and consequently to judge whether the model is overwhelmed or not. These methods share the same fundamental principles, but each one is adapted to specific situations. We can cite, for example, the K-Fold, the Shuffle Split or the Leave-One-Out

## What Are The Possible Reasons For Overfitting?

There are several possible reasons for overfitting, some of them include:

- When the model is too complex relative to the quantity and quality of the training data, it may learn details and noise from the training data instead of generalizing important relationships.
- If the training data is not representative of the target population, the model may learn the peculiarities of the training data instead of generalizing to the target population.
- Models can overfit when they have too many hyperparameters relative to the amount of training data.
- Another reason, maybe the training dataset is not enough to train the model, so more data is needed.

## How to avoid Overfitting?

There are several methods to avoid overfitting in Machine Learning. Here are some examples:

## 1. Feature Engineering

If one only has a limited amount of training samples, each with a large number of features, one should only select the most important features for training so that his model does not need to be trained to so many features and finally overflowing. One can simply test different features, train individual models for those features, and assess generalization capabilities, or use one of the various widely used feature selection methods.

Feature engineering can also help prevent the machine learning model from overfitting, for example, one can deal with outliers, impute missing values, and normalize one’s data. In this way, we simplify our data as much as possible, we improve the performance of the model and we reduce the risk of overfitting.

## 2. Early Stopping

This method simply consists of stopping training when the performance of the model on the validation set begins to decline.

This technique also makes it possible to detect when the model used is not suitable, if we see that the model begins to overfit while the performance is too low, it means that we must change the method.

## 3. Choice of Hyperparameters

It is important to choose the hyperparameters so that they agree with the input data and the learning objectives. Wrong hyperparameters can cause training data to overfit. For example, in the case of nonparametric models, such as the decision tree, the risk of overfitting is significant. But this can be avoided for example by limiting the size of the trees through the hyperparameter (max_depth).

## 4. Add Training Data

A larger dataset would reduce overflow. If we can’t collect more data and we are limited to the data we have, we can artificially increase the size of the data set. For example, in case one is training a model for an image classification task, one can perform various image transformations to the image dataset (flip, rotate, resize, move).

## 5. Ensemble Learning

Ensemble Learning is a Machine Learning concept in which the idea is to train multiple models using the same learning algorithm. The term together refers to a combination of individual patterns creating a stronger and more powerful pattern. It’s about hundreds or thousands of learners with a common goal coming together to solve a problem. This method makes it possible to avoid overfitting the models and to improve their performance and their ability to generalize.

Two of the most common methods of ensemble learning are Boosting and Bagging. Several Machine Learning models are based on this type of learning, we can cite as an example XGBoost, LightGBM, GradientBoost…

## 6. Regularization Methods

Regularization methods are techniques that reduce the overall complexity of a machine learning model. One of the most popular examples of these methods is penalized regression (eg **L1/Lasso** and **L2/Ridge** regularization).

It is a type of regularization that forces the model to achieve a balance between performance and the number of dimensions retained. These are variable selection methods built into Machine Learning models that allow variable selection to be performed based on the results of a penalty function that assigns a weight to each variable.

## 7. The Dropout

In a case of a neural network, we can apply the Dropout, which is a form of regularization, to the layers of the model. This is to ignore a subset of network units. By using dropout, one can reduce interdependent learning between network units, which may have led to overflow. However, with the dropout, we will need more iterations for the model to converge.

One can also reduce the complexity of the model by removing layers and reducing its size by decreasing the number of neurons in fully connected layers.

## ABOUT LONDON DATA CONSULTING (LDC)

**We, at London Data Consulting (LDC), provide all sorts of Data Solutions. This includes Data Science (AI/ML/NLP), Data Engineer, Data Architecture, Data Analysis, CRM & Leads Generation, Business Intelligence and Cloud solutions (AWS/GCP/Azure).**

**For more information about our range of services, please visit: https://london-data-consulting.com/services**

**Interested in working for London Data Consulting, please visit our careers page on https://london-data-consulting.com/careers**