Train your first Super Learner Ensemble Model for classification

Machine-Learning Nov 16, 2021

Selecting a machine learning algorithm for a predictive modelling problem involves evaluating many different models following a lot of brainstorming and model configurations following search for hyperparameters.

Solution:

The super learner is an ensemble machine learning algorithm that combines all of the models and model configurations that you might investigate for a predictive modelling problem and uses them to make a prediction as-good-as or better than any single model that you may have investigated.

Colorful Stack
Photo by La-Rel Easter / Unsplash

The super learner algorithm is an application of stacked generalization, called stacking or blending, to k-fold cross-validation where all models use the same k-fold splits of the data and a meta-model is fit on the out-of-fold predictions from each model.

Intuition:

Consider that you have already fit many different algorithms on your dataset, and some algorithms have been evaluated many times with different configurations. You may have many tens or hundreds of different models of your problem. Why not use all those models instead of the best model from the group?

The super learner algorithm involves first pre-defining the k-fold split of your data, then evaluating all different algorithms and algorithm configurations on the same split of the data. All out-of-fold predictions are then kept and used to train an algorithm that learns how to best combine the predictions.

Ensemble Learning
  1. Performance: An ensemble can make better predictions and achieve better performance than any single contributing model.
  2. Robustness: An ensemble reduces the spread or dispersion of the predictions and model performance.

Let's get into the tutorial!

For this tutorial, we will use get_blobs function which is used to generate blobs of points with a Gaussian distribution. You can control how many blobs to generate and the number of samples to generate, as well as a host of other properties. This will serve as our dataset.

We will build a Super Learner ensemble and check its performance against the different models that the ensemble is made out of.

Step 1: Define the Models & Super Learner

def get_models():
	models = list()
	models.append(LogisticRegression(solver='liblinear'))
	models.append(DecisionTreeClassifier())
	models.append(SVC(gamma='scale', probability=True))
	models.append(GaussianNB())
	models.append(KNeighborsClassifier())
	models.append(AdaBoostClassifier())
	models.append(BaggingClassifier(n_estimators=10))
	models.append(RandomForestClassifier(n_estimators=10))
	models.append(ExtraTreesClassifier(n_estimators=10))
	return models
    
def super_learner_predictions(X, models, meta_model):
	meta_X = list()
	for model in models:
		yhat = model.predict_proba(X) 
		meta_X.append(yhat)
	meta_X = hstack(meta_X)
	# predict
	return meta_model.predict(meta_X)
    
def fit_base_models(X, y, models):
	for model in models:
		model.fit(X, y)
 

def fit_meta_model(X, y):
	model = LogisticRegression(solver='liblinear')
	model.fit(X, y)
	return model

Step 2: Function to make predictions on k-fold spilts

def get_out_of_fold_predictions(X, y, models):
	meta_X, meta_y = list(), list()
	
	kfold = KFold(n_splits=10, shuffle=True)
	
	for train_ix, test_ix in kfold.split(X):
		fold_yhats = list()
		# get data
		train_X, test_X = X[train_ix], X[test_ix]
		train_y, test_y = y[train_ix], y[test_ix]
		meta_y.extend(test_y)
		# fit and make predictions with each sub-model
		for model in models:
			model.fit(train_X, train_y)
			yhat = model.predict_proba(test_X)
			
			fold_yhats.append(yhat)
		
		meta_X.append(hstack(fold_yhats))
	return vstack(meta_X), asarray(meta_y)

Step 3: Function to evaluate predictions

def evaluate_models(X, y, models):
	for model in models:
		yhat = model.predict(X)
		acc = accuracy_score(y, yhat)
		print('%s: %.3f' % (model.__class__.__name__, acc*100))

Step 4: Putting it all together

X, y = make_blobs(n_samples=2000, centers=2, n_features=200, cluster_std=20)


X, X_val, y, y_val = train_test_split(X, y, test_size=0.50)
print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape)


models = get_models()


meta_X, meta_y = get_out_of_fold_predictions(X, y, models)
print('Meta ', meta_X.shape, meta_y.shape)


fit_base_models(X, y, models)


meta_model = fit_meta_model(meta_X, meta_y)


evaluate_models(X_val, y_val, models)


yhat = super_learner_predictions(X_val, models, meta_model)
print('Super Learner: %.3f' % (accuracy_score(y_val, yhat) * 100))

Find the output below:

Super Learner Results

We can see that standalone SVC & GaussianNB give similar or even better performance than Super Learner; while it might be the case with the current randomly generated dataset, it is not the same

[Optional]

Find the public notebook with full implementation below:

[Bonus]

Ensembling is a very popular goto-solution for a lot of problems on Kaggle. Checkout Analytics Vidhya's guide to Ensemble Learning here.

Cheers!

PS: I know this blog is coming a little late; month of Oct'21 has been (a little extra) crazy. I'll try to write some interesting blogs ahead of time in order to stick to the timelines. Thanks :)

Tags

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.