Train your first KNN Model for Collaborative Filtering

Recommendation-Systems Dec 20, 2021

During 2006-2009, Netflix ran an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest.

Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies. Each training rating is a quadruplet of the form <user, movie, date of grade, grade>. The user and movie fields are integer IDs, while grades are from 1 to 5 (integer) stars. The winning team made a model that bested Netflix's own by 10.06%. However, Netflix ended up never using the final Prized model in its Production system.

Browsing Netflix
Photo by Charles Deluvio / Unsplash

Now, why is that so?

Well, because the issues with productionising cost of the complex final model outweighed the benefits of slightly better accuracy with it. The thing is: Machine Learning models in an industry setting care more about training costs, real-world impact kind of improvement from previous model, if the model is explainable (very hard with deep Neural Networks models), how do these models respond to data distribution shift etc.

So, in today's tutorial, we will train a very tractable movie recommendation model using KNN-based Collaborative Filtering on the Movie Lens dataset...

... but there is a catch:

Since, the data set being used has high counts of movies that are rated in low numbers, we will only run our model on popular movies that have high number of ratings.

Imbalance in Rating from ID 0 to 25000

What is Collaborative Filtering?

Collaborative filtering  builds a model from a user’s past behaviours (items previously selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in.

Why are we using KNN?

  1. To implement an item-based collaborative filtering, KNN is a perfect go-to model and also a very good baseline for recommender system development.
  2. KNN does not make any assumptions on the underlying data distribution but it relies on item feature similarity.

Let's get into the tutorial!

Step 1: Import and Preprocess the Dataset

movies_filename = '../input/movielens-20m-dataset/movie.csv'
ratings_filename = '../input/movielens-20m-dataset/rating.csv'

df_movies = pd.read_csv(
    usecols=['movieId', 'title'],
    dtype={'movieId': 'int32', 'title': 'str'})

df_ratings = pd.read_csv(
    usecols=['userId', 'movieId', 'rating'],
    dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

df_movies_cnt = pd.DataFrame(df_ratings.groupby('movieId').size(), columns=['count'])
df_movies_cnt['count'].quantile(np.arange(1, 0.6, -0.05))
# Filter Data to take out only Popular Movies
popularity_thres = 50
popular_movies = list(set(df_movies_cnt.query('count >= @popularity_thres').index))
df_ratings_drop_movies = df_ratings[df_ratings.movieId.isin(popular_movies)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping unpopular movies: ', df_ratings_drop_movies.shape)

# Filter for Inactive Users
ratings_thres = 50
active_users = list(set(df_users_cnt.query('count >= @ratings_thres').index))
df_ratings_drop_users = df_ratings_drop_movies[df_ratings_drop_movies.userId.isin(active_users)]
print('shape of original ratings data: ', df_ratings.shape)
print('shape of ratings data after dropping both unpopular movies and inactive users: ', df_ratings_drop_users.shape)

Step 2: Pivot the dataset to pass onto the Model

# Pivot and create movie-user matrix
movie_user_mat = df_ratings_drop_users.pivot(index='movieId', columns='userId', values='rating').fillna(0)
# create mapper from movie title to index
movie_to_idx = {
    movie: i for i, movie in 
# transform matrix to scipy sparse matrix
movie_user_mat_sparse = csr_matrix(movie_user_mat.values)

Step 3: Defining the Model and Training

model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)

Step 4: Setup Recommendation Logic

def make_recommendation(model_knn, data, mapper, fav_movie, n_recommendations):
    print('You have input movie:', fav_movie)
    #the function below is a helper function defined to check presence of Movie Name
    idx = fuzzy_matching(mapper, fav_movie, verbose=True)
    print('Recommendation system start to make inference')
    distances, indices = model_knn.kneighbors(data[idx], n_neighbors=n_recommendations+1)
    # get list of raw idx of recommendations
    raw_recommends = \
        sorted(list(zip(indices.squeeze().tolist(), distances.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
    # get reverse mapper
    reverse_mapper = {v: k for k, v in mapper.items()}
    # print recommendations
    print('Recommendations for {}:'.format(fav_movie))
    for i, (idx, dist) in enumerate(raw_recommends):
        print('{0}: {1}, with distance of {2}'.format(i+1, reverse_mapper[idx], dist))

Step 5: Evaluation

my_favorite = 'Shawshank Redemption'

Results Obtained


Checkout the public notebook here:


Item-based collaborative filtering suffers from Popularity Bias and Cold-Start problem (inability to recommend new or lesser-known items because of lesser number of interactions.) To mitigate this, one can use Alternating Least Square (ALS) Matrix Factorisation in Collaborative Filtering, a very good tutorial for this can be found here.