Selecting a machine learning algorithm for a predictive modelling problem involves evaluating many different models following a lot of brainstorming and model configurations following search for
The super learner is an ensemble machine learning algorithm that combines all of the models and model configurations that you might investigate for a predictive modelling problem and uses them to make a prediction as-good-as or better than any single model that you may have investigated.
The super learner algorithm is an application of stacked generalization, called
blending, to k-fold cross-validation where all models use the same k-fold splits of the data and a meta-model is fit on the out-of-fold predictions from each model.
Consider that you have already fit many different algorithms on your dataset, and some algorithms have been evaluated many times with different configurations. You may have many tens or hundreds of different models of your problem. Why not use all those models instead of the best model from the group?
The super learner algorithm involves first pre-defining the k-fold split of your data, then evaluating all different algorithms and algorithm configurations on the same split of the data. All out-of-fold predictions are then kept and used to train an algorithm that learns how to best combine the predictions.
Why is ensembling so popular?
- Performance: An ensemble can make better predictions and achieve better performance than any single contributing model.
- Robustness: An ensemble reduces the spread or dispersion of the predictions and model performance.
Let's get into the tutorial!
For this tutorial, we will use
get_blobs function which is used to generate blobs of points with a Gaussian distribution. You can control how many blobs to generate and the number of samples to generate, as well as a host of other properties. This will serve as our dataset.
We will build a Super Learner ensemble and check its performance against the different models that the ensemble is made out of.
Step 1: Define the Models & Super Learner
def get_models(): models = list() models.append(LogisticRegression(solver='liblinear')) models.append(DecisionTreeClassifier()) models.append(SVC(gamma='scale', probability=True)) models.append(GaussianNB()) models.append(KNeighborsClassifier()) models.append(AdaBoostClassifier()) models.append(BaggingClassifier(n_estimators=10)) models.append(RandomForestClassifier(n_estimators=10)) models.append(ExtraTreesClassifier(n_estimators=10)) return models def super_learner_predictions(X, models, meta_model): meta_X = list() for model in models: yhat = model.predict_proba(X) meta_X.append(yhat) meta_X = hstack(meta_X) # predict return meta_model.predict(meta_X) def fit_base_models(X, y, models): for model in models: model.fit(X, y) def fit_meta_model(X, y): model = LogisticRegression(solver='liblinear') model.fit(X, y) return model
Step 2: Function to make predictions on k-fold spilts
def get_out_of_fold_predictions(X, y, models): meta_X, meta_y = list(), list() kfold = KFold(n_splits=10, shuffle=True) for train_ix, test_ix in kfold.split(X): fold_yhats = list() # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] meta_y.extend(test_y) # fit and make predictions with each sub-model for model in models: model.fit(train_X, train_y) yhat = model.predict_proba(test_X) fold_yhats.append(yhat) meta_X.append(hstack(fold_yhats)) return vstack(meta_X), asarray(meta_y)
Step 3: Function to evaluate predictions
def evaluate_models(X, y, models): for model in models: yhat = model.predict(X) acc = accuracy_score(y, yhat) print('%s: %.3f' % (model.__class__.__name__, acc*100))
Step 4: Putting it all together
X, y = make_blobs(n_samples=2000, centers=2, n_features=200, cluster_std=20) X, X_val, y, y_val = train_test_split(X, y, test_size=0.50) print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape) models = get_models() meta_X, meta_y = get_out_of_fold_predictions(X, y, models) print('Meta ', meta_X.shape, meta_y.shape) fit_base_models(X, y, models) meta_model = fit_meta_model(meta_X, meta_y) evaluate_models(X_val, y_val, models) yhat = super_learner_predictions(X_val, models, meta_model) print('Super Learner: %.3f' % (accuracy_score(y_val, yhat) * 100))
Find the output below:
We can see that standalone SVC & GaussianNB give similar or even better performance than Super Learner; while it might be the case with the current randomly generated dataset, it is not the same
Find the public notebook with full implementation below:
PS: I know this blog is coming a little late; month of
Oct'21 has been (a little extra) crazy. I'll try to write some interesting blogs ahead of time in order to stick to the timelines. Thanks :)