Train your first KNN Model for Collaborative Filtering
During 2006-2009, Netflix ran an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest.
Netflix provided a training data set of 100,480,507
ratings that 480,189
users gave to 17,770
movies. Each training rating is a quadruplet of the form <user, movie, date of grade, grade>
. The user and movie fields are integer IDs, while grades are from 1 to 5 (integer) stars. The winning team made a model that bested Netflix's own by 10.06%. However, Netflix ended up never using the final Prized model in its Production system.
Now, why is that so?
Well, because the issues with productionising
cost of the complex final model outweighed the benefits of slightly better accuracy with it. The thing is: Machine Learning models in an industry setting care more about training costs, real-world impact kind of improvement from previous model, if the model is explainable (very hard with deep Neural Networks models), how do these models respond to data distribution shift etc.
So, in today's tutorial, we will train a very tractable movie recommendation model using KNN-based Collaborative Filtering on the Movie Lens dataset...
... but there is a catch:
Since, the data set being used has high counts of movies that are rated in low numbers, we will only run our model on popular movies that have high number of ratings.

What is Collaborative Filtering?
Collaborative filtering builds a model from a user’s past behaviours (items previously selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in.

Why are we using KNN?
- To implement an item-based collaborative filtering, KNN is a perfect go-to model and also a very good baseline for recommender system development.
- KNN does not make any assumptions on the underlying data distribution but it relies on item feature similarity.
Let's get into the tutorial!
Step 1: Import and Preprocess the Dataset
movies_filename = '../input/movielens-20m-dataset/movie.csv'
ratings_filename = '../input/movielens-20m-dataset/rating.csv'
df_movies = pd.read_csv(
movies_filename,
usecols=['movieId', 'title'],
dtype={'movieId': 'int32', 'title': 'str'})
df_ratings = pd.read_csv(
ratings_filename,
usecols=['userId', 'movieId', 'rating'],
dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})
df_movies_cnt = pd.DataFrame(df_ratings.groupby('movieId').size(), columns=['count'])
df_movies_cnt['count'].quantile(np.arange(1, 0.6, -0.05))
# Filter Data to take out only Popular Movies
popularity_thres = 50
popular_movies = list(set(df_movies_cnt.query('count >= @popularity_thres').index))
df_ratings_drop_movies = df_ratings[df_ratings.movieId.isin(popular_movies)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping unpopular movies: ', df_ratings_drop_movies.shape)
# Filter for Inactive Users
ratings_thres = 50
active_users = list(set(df_users_cnt.query('count >= @ratings_thres').index))
df_ratings_drop_users = df_ratings_drop_movies[df_ratings_drop_movies.userId.isin(active_users)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping both unpopular movies and inactive users: ', df_ratings_drop_users.shape)
Step 2: Pivot the dataset to pass onto the Model
# Pivot and create movie-user matrix
movie_user_mat = df_ratings_drop_users.pivot(index='movieId', columns='userId', values='rating').fillna(0)
# create mapper from movie title to index
movie_to_idx = {
movie: i for i, movie in
enumerate(list(df_movies.set_index('movieId').loc[movie_user_mat.index].title))
}
# transform matrix to scipy sparse matrix
movie_user_mat_sparse = csr_matrix(movie_user_mat.values)
Step 3: Defining the Model and Training
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
model_knn.fit(movie_user_mat_sparse)
Step 4: Setup Recommendation Logic
def make_recommendation(model_knn, data, mapper, fav_movie, n_recommendations):
model_knn.fit(data)
print('You have input movie:', fav_movie)
#the function below is a helper function defined to check presence of Movie Name
idx = fuzzy_matching(mapper, fav_movie, verbose=True)
print('Recommendation system start to make inference')
print('......\n')
distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
# get list of raw idx of recommendations
raw_recommends = \
sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
# get reverse mapper
reverse_mapper = {v: k for k, v in mapper.items()}
# print recommendations
print('Recommendations for {}:'.format(fav_movie))
for i, (idx, dist) in enumerate(raw_recommends):
print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[idx], dist))
Step 5: Evaluation
my_favorite = 'Shawshank Redemption'
make_recommendation(
model_knn=model_knn,
data=movie_user_mat_sparse,
fav_movie=my_favorite,
mapper=movie_to_idx,
n_recommendations=10)

[Optional]
Checkout the public notebook here:
[Bonus]
Item-based collaborative filtering suffers from Popularity Bias and Cold-Start problem (inability to recommend new or lesser-known items because of lesser number of interactions.) To mitigate this, one can use Alternating Least Square (ALS) Matrix Factorisation in Collaborative Filtering, a very good tutorial for this can be found here.
Cheers!