Ensemble Method — Boosting

Ali Mahzoon
5 min readJul 6, 2021

Boosting (Originally called hypothesis boosting) refers to any ensemble method that can combine several weak learners and their mistakes into a strong learner.

The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular ones are AdaBoost (short for Adaptive Boosting) and Gradient Boosting.

Bagging VS. Boosting

In the picture below the single model is a decision tree and you just have one model train on your data and that’s it.

With Bagging, we are training a lot of different trees separately and then we aggregate their outputs (Voting method or Majority class method)

But in Boosting as we can see, we are going to run data through a single model (Decision Tree) and take the output from the model and then retrain our model over and over again based on the output of our previous model and this is why it is a sequential model training.

Single, Bagging, and Boosting

Boosting Models

- Boosting is a statistical framework where the objective is to minimize the loss of the model by adding weak learners using a gradient descent-like procedure.

- This allows different loss functions to be used, expanding the technique beyond binary classification problems to support regression, multi-class classification, and more.

Boosted algorithms exist for trees and are also a general procedure other classifiers can perform. For example, the default for AdaBoost is Decision Trees but we can change it and instantiate another model like KNN.

AdaBoost — Intuition

The intuition of AdaBoost specifically is: you tell me what did not work and I focus more on those misclassified points.

As we can see in the following picture (Top-Bottom, Left to Right), the default is a decision tree with a max depth of one. So, you only get one split. After the split, we can see we have 3 misclassified points (2 Minuses and one Plus). What it is going to do for the next model is to weigh these points higher so that your next model is going to focus on those misclassified points and retrain your classifiers again and as we can see we now have 3 Minuses as misclassified points and our previous misclassified points classified correctly. This classifier is going to keep repeating this process over and over again just refocusing on any misclassified points many times and then you can aggregate results.

How AdaBoost works?

AdaBoost is an ensemble method used to solve classification problems.

1- Initialize the weight of each of the observations

2- Fit a base classifier like a decision tree

3- Use that classifier to make predictions on the training set.

4- Increase the relative weight of the instances that were misclassified

5- Train another classifier using the updated weight

6- Aggregate all of the classifiers into one, weighting them by their accuracy

Note: the first classifier is trained using a random subset of the overall training set. Misclassified item is assigned a higher weight so that it appears in the training subset of next classifier with higher probability.

Data Preparation for AdaBoost

This section lists some heuristics for best preparing your data for AdaBoost.

- Quality Data: Because the ensemble method continues to attempt to correct misclassifications in the training data, you need to be careful that the training data is of high quality. (No incorrect raw Data!)

- Outliers: outliers will force the ensemble down the rabbit hole of working to correct for unrealistic cases. These could be removed from the training dataset.

-Noisy Data: Noisy data, specifical noise in the output variable can be problematic. If possible, attempt to isolate and clean these from your training dataset.

AdaBoost Summary

AdaBoost sequentially trains classifiers.

Tries to improve the classifier by looking at the misclassified instances.

The algorithm weighs misclassified instances more so they are more likely to be included in the training data subset.

Each classifier is weighted based on its accuracy and then all are aggregated to create the final classifier.

Does not perform well on very noisy data or data with outliers.

Adaboost can use any type of classification model, not just a decision tree.

Gradient Boosting

Just like AdaBoost, Gradient Boosting evaluates the collection of trees sequentially, and gradually ensembles a collection of trees that model the underlying behavior of our data

Unlike Adaboost, Gradient Boost evaluates the residuals of the previous tree and tries to predict the residuals.

Gradient Boosting Over Iterations

Because we are trying to regress on our residuals it slowly gets to be more and more granular and precise with fitting your line to your data. So, over time with more predictors, you’ll see that the residuals start to converge. So, here you can see that they are still pretty split up after three iterations but after 18 they are pretty much flat.

Previously without a boost, we were just re-weighting the actual data points here you just regressing on the residuals.

Something that you end up tuning with gradient boosting is how many trees should I create. There is a very common graph to plot with any gradient boosting model.

The more estimators you have, the more tend to get overfitting

Regression VS. classification in Gradient Boosting

Regression

- Start with a weak tree

- Minimize objective function (cost function) by finding residuals of prior tree

- Create a new tree using gradients from prior tree

Classification

- Start with a weak tree

- Maximize log-likelihood by calculating the partial derivative of log probability for all objects in the dataset.

- Create a new tree using these gradients and add it to your ensemble using a pre-determined weight, which is the gradient descent learning rate.

--

--