Gradient Boosting Algorithm: A Complete Guide for Beginners

Source Node: 1875196

This article was published as a part of the Data Science Blogathon

Gradient Boosting Algorithm

Introduction

In this article, I am going to discuss the math intuition behind the Gradient boosting algorithm. It is more popularly known as Gradient boosting Machine or GBM. It is a boosting method and I have talked more about boosting in this article.

Gradient boosting is a method standing out for its prediction speed and accuracy, particularly with large and complex datasets. From Kaggle competitions to machine learning solutions for business, this algorithm has produced the best results. We already know that errors play a major role in any machine learning algorithm. There are mainly two types of error, bias error and variance error. Gradient boost algorithm helps us minimize bias error of the model

Before getting into the details of this algorithm we must have some knowledge about AdaBoost Algorithm which is again a boosting method. This algorithm starts by building a decision stump and then assigning equal weights to all the data points. Then it increases the weights for all the points which are misclassified and lowers the weight for those that are easy to classify or are correctly classified. A new decision stump is made for these weighted data points. The idea behind this is to improve the predictions made by the first stump. I have talked more about this algorithm here. Read this article before starting this algorithm to get a better understanding.

The main difference between these two algorithms is that Gradient boosting has a fixed base estimator i.e., Decision Trees whereas in AdaBoost we can change the base estimator according to our needs.

Table of Contents

  1. Gradient Boosting Algorithm
  2. Gradient Boosting Regressor
  3. Example of gradient boosting
  4. Gradient Boosting Classifier
  5. End Notes

What is a Gradient boosting Algorithm?

The main idea behind this algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model. But how do we do that? How do we reduce the error? This is done by building a new model on the errors or residuals of the previous model.

When the target column is continuous, we use Gradient Boosting Regressor whereas when it is a classification problem, we use Gradient Boosting Classifier. The only difference between the two is the “Loss function”. The objective here is to minimize this loss function by adding weak learners using gradient descent. Since it is based on loss function hence for regression problems, we’ll have different loss functions like Mean squared error (MSE) and for classification, we will have different for e.g log-likelihood.

Understand Gradient Boosting Algorithm with example

Let’s understand the intuition behind Gradient boosting with the help of an example. Here our target column is continuous hence we will use Gradient Boosting Regressor.

Following is a sample from a random dataset where we have to predict the car price based on various features. The target column is price and other features are independent features.

data example

Image Source: Author

Step -1 The first step in gradient boosting is to build a base model to predict the observations in the training dataset. For simplicity we take an average of the target column and assume that to be the predicted value as shown below:

base model prediction | Gradient Boosting Algorithm

Image Source: Author

Why did I say we take the average of the target column? Well, there is math involved behind this. Mathematically the first step can be written as:

argmin

 

Looking at this may give you a headache, but don’t worry we will try to understand what is written here.

Here L is our loss function

Gamma is our predicted value

argmin means we have to find a predicted value/gamma for which the loss function is minimum.

Since the target column is continuous our loss function will be:

loss function | Gradient Boosting Algorithm

Here yi is the observed value

And gamma is the predicted value

Now we need to find a minimum value of gamma such that this loss function is minimum. We all have studied how to find minima and maxima in our 12th grade. Did we use to differentiate this loss function and then put it equal to 0 right? Yes, we will do the same here.

differentiate loss function

Let’s see how to do this with the help of our example. Remember that y_i is our observed value and gamma_i is our predicted value, by plugging the values in the above formula we get:

plug values | Gradient Boosting Algorithm

We end up over an average of the observed car price and this is why I asked you to take the average of the target column and assume it to be your first prediction.

Hence for gamma=14500, the loss function will be minimum so this value will become our prediction for the base model.

Step-2 The next step is to calculate the pseudo residuals which are (observed value – predicted value)

calculate residual | Gradient Boosting Algorithm

Image Source: Author

Again the question comes why only observed – predicted? Everything is mathematically proved, let’s from where did this formula come from. This step can be written as:

rewrite | Gradient Boosting Algorithm

Here F(xi) is the previous model and m is the number of DT made.

We are just taking the derivative of loss function w.r.t the predicted value and we have already calculated this derivative:

calculate derivative

If you see the formula of residuals above, we see that the derivative of the loss function is multiplied by a negative sign, so now we get:

error

The predicted value here is the prediction made by the previous model. In our example the prediction made by the previous model (initial base model prediction) is 14500, to calculate the residuals our formula becomes:

calculating errors

In the next step, we will build a model on these pseudo residuals and make predictions. Why do we do this? Because we want to minimize these residuals and minimizing the residuals will eventually improve our model accuracy and prediction power. So, using the Residual as target and the original feature Cylinder number, cylinder height, and Engine location we will generate new predictions. Note that the predictions, in this case, will be the error values, not the predicted car price values since our target column is an error now.

Let’s say hm(x) is our DT made on these residuals.

Step- 4 In this step we find the output values for each leaf of our decision tree. That means there might be a case where 1 leaf gets more than 1 residual, hence we need to find the final output of all the leaves. TO find the output we can simply take the average of all the numbers in a leaf, doesn’t matter if there is only 1 number or more than 1.

Let’s see why do we take the average of all the numbers. Mathematically this step can be represented as:

minimum error

Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1 we are talking about the 1st DT and when it is “M” we are talking about the last DT.

The output value for the leaf is the value of gamma that minimizes the Loss function. The left-hand side “Gamma” is the output value of a particular leaf. On the right-hand side [Fm-1(xi)+ƴhm(xi))] is similar to step 1 but here the difference is that we are taking previous predictions whereas earlier there was no previous prediction.

Let’s understand this even better with the help of an example. Suppose this is our regressor tree:

tree | Gradient Boosting Algorithm

Image Source

We see 1st residual goes in R1,1  ,2nd and 3rd residuals go in R2,1 and 4th residual goes in R3,1 .

Let’s calculate the output for the first leave that is R1,1

output first leave | Gradient Boosting Algorithm

Now we need to find the value for gamma for which this function is minimum. So we find the derivative of this equation w.r.t gamma and put it equal to 0.

gamma

Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1

Let’s take the derivative to get the minimum value of gamma for which this function is minimum:

minimum | Gradient Boosting Algorithm

We end up with the average of the residuals in the leaf R2,1 . Hence if we get any leaf with more than 1 residual, we can simply find the average of that leaf and that will be our final output.

Now after calculating the output of all the leaves, we get:

output for all leaves | Gradient Boosting Algorithm
Image Source: Author

Step-5 This is finally the last step where we have to update the predictions of the previous model. It can be updated as:

update model | Gradient Boosting Algorithm

where m is the number of decision trees made.

Since we have just started building our model so our m=1. Now to make a new DT our new predictions will be:

new predictions

Image Source: Author

Here Fm-1(x) is the prediction of the base model (previous prediction) since F1-1=0 , F0 is our base model hence the previous prediction is 14500.

nu is the learning rate that is usually selected between 0-1. It reduces the effect each tree has on the final prediction, and this improves accuracy in the long run. Let’s take nu=0.1 in this example.

Hm(x) is the recent DT made on the residuals.

Let’s calculate the new prediction now:

calculate new predictions | Gradient Boosting Algorithm

Image Source: Author

Suppose we want to find a prediction of our first data point which has a car height of 48.8. This data point will go through this decision tree and the output it gets will be multiplied with the learning rate and then added to the previous prediction.

Now let’s say m=2 which means we have built 2 decision trees and now we want to have new predictions.

This time we will add the previous prediction that is F1(x) to the new DT made on residuals. We will iterate through these steps again and again till the loss is negligible.

I am taking a hypothetical example here just to
make you understand how this predicts for a new dataset:

predictions for new data
Image Source: Author

If a new data point says height = 1.40 comes, it’ll go through all the trees and then will give the prediction. Here we have only 2 trees hence the datapoint will go through these 2 trees and the final output will be F2(x).

What is Gradient Boosting Classifier?

A gradient boosting classifier is used when the target column is binary. All the steps explained in the Gradient boosting regressor are used here, the only difference is we change the loss function. Earlier we used Mean squared error when the target column was continuous but this time, we will use log-likelihood as our loss function.

Let’s see how this loss function works, to read more about log-likelihood I recommend you to go through this article where I have given each detail you need to understand this.

The loss function for the classification problem is given below:

loss function

Our first step in the gradient boosting algorithm was to initialize the model with some constant value, there we used the average of the target column but here we’ll use log(odds) to get that constant value. The question comes why log(odds)?

When we differentiate this loss function, we will get a function of log(odds) and then we need to find a value of log(odds) for which the loss function is minimum.

Confused right? Okay let’s see how it works:

Let’s first transform this loss function so that it is a function of log(odds), I’ll tell you later why we did this transformation.

function of loss | Gradient Boosting Algorithm
log(odds) | Gradient Boosting Algorithm
calculate loss | Gradient Boosting Algorithm

Now this is our loss function, and we need to minimize it, for this, we take the derivative of this w.r.t to log(odds) and then put it equal to 0,

derivative of loss | Gradient Boosting Algorithm

Here y are the observed values

You must be wondering that why did we transform the loss function into the function of log(odds). Actually, sometimes it is easy to use the function of log(odds), and sometimes it’s easy to use the function of predicted probability “p”.

It is not compulsory to transform the loss function, we did this just to have easy calculations.

Hence the minimum value of this loss function will be our first prediction (base model prediction)

Now in the Gradient boosting regressor our next step was to calculate the pseudo residuals where we multiplied the derivative of the loss function with -1. We will do the same but now the loss function is different, and we are dealing with the probability of an outcome now.

After finding the residuals we can build a decision tree with all independent variables and target variables as “Residuals”.

Now when we have our first decision tree, we find the final output of the leaves because there might be a case where a leaf gets more than 1 residuals, so we need to calculate the final output value. The math behind this step is out of the scope of this article so I will mention the direct formula to calculate the output of a leaf:

Finally, we are ready to get new predictions by adding our base model with the new tree we made on residuals.

There are a few variations of gradient boosting and a couple of them are momentarily clarified in the coming article.

End Notes

I hope you got an understanding of how the Gradient Boosting algorithm works under the hood. I have tried to show you the math behind this is the easiest way possible.

In the next article, I will explain Xtreme Gradient Boosting (XGB), which is again a new technique to combine various models and to improve our accuracy score. It is just an extension of the gradient boost algorithm.

Let me know if you have any queries in the comments below.

About the Author

I am an undergraduate student currently in my last year majoring in Statistics (Bachelors of Statistics) and have a strong interest in the field of data science, machine learning, and artificial intelligence. I enjoy diving into data to discover trends and other valuable insights about the data. I am constantly learning and motivated to try new things.

I am open to collaboration and work.

For any doubt and queries, feel free to contact me on Email

Connect with me on LinkedIn and Twitter

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Source: https://www.analyticsvidhya.com/blog/2021/09/gradient-boosting-algorithm-a-complete-guide-for-beginners/

Time Stamp:

More from Analytics Vidhya