In this tutorial, we will dive into recommendation systems.
You might not know what recommendation systems are but you see them everywhere on the internet.
Everytime you shop on Amazon and you see related products…
Or when Netflix recommends you something interesting to watch…
The purpose of a recommendation system is to predict a rating that a user will give to an item that they have not yet rated.
This rating is produced by analyzing either item characteristics or other user/item ratings (or both) to provide personalized recommendations to users.
There are 2 main approaches to recommendation systems:
- Content Filtering. Recommendations depend on item characteristics.
- Collaborative Filtering. Recommendations depend on user-item ratings.
In this tutorial we will work with the MovieLens Dataset. This dataset contains user generated movie ratings from the website MovieLens (https://movielens.org/).
It contains multiple files, but the ones we will use in this tutorial will be movies.dat and ratings.dat.
First we will download the dataset:
wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
unzip ml-1m.zip
cd ml-1m/
Content Filtering
Here are the first rows of the movies.dat file. The file follows the format:
movie_id::movie_title::movie genre(s)
head movies.dat
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
With genres being separated by a pipe |.
We load now the movies file:
import pandas as pd
import numpy as np
movies_df = pd.read_table('movies.dat', header=None, sep='::', names=['movie_id', 'movie_title', 'movie_genre'])
movies_df.head()
Out[]:
movie_id | movie_title | movie_genre | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children’s|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children’s|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
In order to be able to work with the movie_genre column, we need to transform it to what is called “dummy variables”.
This is a way to convert a categorical variable (e.g. Animation, Comedy, Romance…), into multiple columns (one column named Action, one named Comedy, etc).
For each movie, these dummy columns will have a value of 0 except for those genres the movie has.
# we convert the movie genres to a set of dummy variables
movies_df = pd.concat([movies_df, movies_df.movie_genre.str.get_dummies(sep='|')], axis=1)
movies_df.head()
Out[]:
movie_id | movie_title | movie_genre | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | ... | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy | 0 | 0 | 1 | 1 | 1 | 0 | 0 | ... |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy | 0 | 1 | 0 | 1 | 0 | 0 | 0 | ... |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... |
4 | 5 | Father of the Bride Part II (1995) | Comedy | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... |
So for example, the movie with an id of 1 Toy Story, belongs to the genres Animation, Children’s and Comedy, and thus the columns Animation, Children’s and Comedy have a value of 1.
movie_categories = movies_df.columns[3:]
movies_df.loc[0]
Out[]:
Content filtering is a simple way to build a recommendation system. Here, items (in this example movies) are mapped to a set of features (genres).
To recommend a user an item, first that user has to provide his/her preferences regarding those features.
So in this example, the user has to tell the system how much does he or she like each movie genre.
Right now we have all the movies mapped into genres. We just need to create a user and map that user into those genres.
Let’s create a user with strong preference for action, adventure and fiction movies.
from collections import OrderedDict
user_preferences = OrderedDict(zip(movie_categories, []))
user_preferences['Action'] = 5
user_preferences['Adventure'] = 5
user_preferences['Animation'] = 1
user_preferences["Children's"] = 1
user_preferences["Comedy"] = 3
user_preferences['Crime'] = 2
user_preferences['Documentary'] = 1
user_preferences['Drama'] = 1
user_preferences['Fantasy'] = 5
user_preferences['Film-Noir'] = 1
user_preferences['Horror'] = 2
user_preferences['Musical'] = 1
user_preferences['Mystery'] = 3
user_preferences['Romance'] = 1
user_preferences['Sci-Fi'] = 5
user_preferences['War'] = 3
user_preferences['Thriller'] = 2
user_preferences['Western'] =1
Once we have users with their movie genre preferences and the movies mapped into genres, to compute the score of a movie for a specific user, we just need to calculate the dot product of that movie genre vector with that user preferences vector.
# in production you would use np.dot instead of writing your own dot product function.
def dot_product(vector_1, vector_2):
return sum([ i*j for i,j in zip(vector_1, vector_2)])
def get_movie_score(movie_features, user_preferences):
return dot_product(movie_features, user_preferences)
Let’s compute the score of the movie ‘Toy Story’ (a children’s animation movie) for the sample user.
toy_story_features = movies_df.loc[0][movie_categories]
toy_story_features
toy_story_user_predicted_score = dot_product(toy_story_features, user_preferences.values())
toy_story_user_predicted_score
Out[]:
5
So for the user, Toy Story, has a score of 5. Which does not mean much by itself, but helps us comparing how good of a recommendation Toy Story is compared to other movies.
Let’s calculate the score for Die Hard (a thrilling action movie):
movies_df[movies_df.movie_title.str.contains('Die Hard')]
movie_id | movie_title | movie_genre | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | ... | |
---|---|---|---|---|---|---|---|---|---|---|---|
163 | 165 | Die Hard: With a Vengeance (1995) | Action|Thriller | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
1023 | 1036 | Die Hard (1988) | Action|Thriller | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
1349 | 1370 | Die Hard 2 (1990) | Action|Thriller | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
die_hard_id = 1036
die_hard_features = movies_df[movies_df.movie_id==die_hard_id][movie_categories]
die_hard_features.T
Out[]:
1023 | |
---|---|
Action | 1 |
Adventure | 0 |
Animation | 0 |
Children’s | 0 |
Comedy | 0 |
Crime | 0 |
Documentary | 0 |
Drama | 0 |
Fantasy | 0 |
Film-Noir | 0 |
Horror | 0 |
Musical | 0 |
Mystery | 0 |
Romance | 0 |
Sci-Fi | 0 |
Thriller | 1 |
War | 0 |
Western | 0 |
note, 1023 is the dataframe row index for Die Hard, not the movie index in the movielens dataset
die_hard_user_predicted_score = dot_product(die_hard_features.values[0], user_preferences.values())
die_hard_user_predicted_score
Out[]:
8
So we see that Die Hard gets an score of 8 vs a 5 for Toy Story. So Die Hard would be recommended before Toy Story. Which makes sense, given this user’s preferences are skewed towards action packed movies.
Once we know how to calculate the score for one movie, providing movie recommendations for the user is as easy as calculating the score for all the movies and returning those with the highest scores.
def get_movie_recommendations(user_preferences, n_recommendations):
# we add a column to the movies_df dataset with the calculated score for each movie for the given user
movies_df['score'] = movies_df[movie_categories].apply(get_movie_score,
args=([user_preferences.values()]), axis=1)
return movies_df.sort_values(by=['score'], ascending=False)['movie_title'][:n_recommendations]
get_movie_recommendations(user_preferences, 10)
Out[]:
So the system recommends heavy action and scifi movies. Neat!
Content Filtering makes recommending to a new user very easy. Users just have to express their preferences once. However, Content Filtering shows some caveats:
Need to map each item into the feature space. That means that any time a new item gets added, someone has to manually categorize that item.
Recommendations are limited in scope. This means items can’t be categorized in new features.
So content filtering is maybe a too simple option nowadays, which leads us to…:
Collaborative Filtering
Collaborative filtering is another way of predicting user-item scores. This time though, we will use the existing user-item scores to predict the missing ones.
The assumption is that users get value from recommendations based on other users with similar tastes.
For this example we will use the ratings.dat file. This file follows the format:
user_id::movie_id::rating::timestamp
head ratings.dat
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
The MovieLens dataset provides us with a file that includes over 1 million movie ratings.
ratings_df = pd.read_table('ratings.dat', header=None, sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'])
# we dont care about the time the rating was given
del ratings_df['timestamp']
# replace movie_id with movie_title for legibility
ratings_df = pd.merge(ratings_df, movies_df, on='movie_id')[['user_id', 'movie_title', 'movie_id','rating']]
ratings_df.head()
Out[]:
user_id | movie_title | movie_id | rating | |
---|---|---|---|---|
0 | 1 | One Flew Over the Cuckoo’s Nest (1975) | 1193 | 5 |
1 | 2 | One Flew Over the Cuckoo’s Nest (1975) | 1193 | 5 |
2 | 12 | One Flew Over the Cuckoo’s Nest (1975) | 1193 | 4 |
3 | 15 | One Flew Over the Cuckoo’s Nest (1975) | 1193 | 4 |
4 | 17 | One Flew Over the Cuckoo’s Nest (1975) | 1193 | 5 |
The dataset is a matrix of users and movie ratings, so we convert the ratings_df to a matrix with a user per row and a movie per column.
ratings_mtx_df = ratings_df.pivot_table(values='rating', index='user_id', columns='movie_title')
ratings_mtx_df.fillna(0, inplace=True)
movie_index = ratings_mtx_df.columns
ratings_mtx_df.head()
Out[]:
movie_title | $1,000,000 Duck (1971) | 'Night Mother (1986) | 'Til There Was You (1997) | ... |
---|---|---|---|---|
user_id | ||||
1 | 0 | 0 | 0 | ... |
2 | 0 | 0 | 0 | ... |
3 | 0 | 5 | 0 | ... |
4 | 0 | 0 | 1 | ... |
5 | 0 | 0 | 0 | ... |
We have a matrix of 6040 users and 3706 movies.
To compute similarities between movies, one way is to find the correlation between movies and then use that correlation to find similar movies to those the users have liked.
An easy way of doing this is in python is by using the numpy.corrcoef
function, that calculates the Pearson Product Moment Correlation Coefficient (PMCC) between each item pair.
the PMCC has a value between -1 and 1 that measures the correlation (positive or negative) between two variables.
A correlation matrix is a matrix of m x m shape, where element Mij represents the correlation between item i and item j.
corr_matrix = np.corrcoef(ratings_mtx_df.T)
corr_matrix.shape
Out[]:
(3706, 3706)
Note: We use the transposed ratings matrix to calculate the correlation matrix so it gives back the correlation between movies (rows). If we used the ratings matrix without transposing it, np.corrcoef
would return the correlation between users.
Now, if we want to find similar movies to a specific movie, it’s just a matter of returning those movies that have a high correlation coefficent with that one.
favoured_movie_title = 'Toy Story (1995)'
favoured_movie_index = list(movie_index).index(favoured_movie_title)
P = corr_matrix[favoured_movie_index]
# only return those movies with a high correlation with Toy Story
list(movie_index[(P>0.4) & (P<1.0)])
Out[]:
['Aladdin (1992)',
"Bug's Life, A (1998)",
'Groundhog Day (1993)',
'Lion King, The (1994)',
'Toy Story 2 (1999)']
Now to provide recommendations to a user, we take the list of movies that user has rated. Then we sum the correlations of those movies with all the other ones and return a list of those movies sorted by their total correlation with the user.
def get_movie_similarity(movie_title):
'''Returns correlation vector for a movie'''
movie_idx = list(movie_index).index(movie_title)
return corr_matrix[movie_idx]
def get_movie_recommendations(user_movies):
'''given a set of movies, it returns all the movies sorted by their correlation with the user'''
movie_similarities = np.zeros(corr_matrix.shape[0])
for movie_id in user_movies:
movie_similarities = movie_similarities + get_movie_similarity(movie_id)
similarities_df = pd.DataFrame({
'movie_title': movie_index,
'sum_similarity': movie_similarities
})
similarities_df = similarities_df[~(similarities_df.movie_title.isin(user_movies))]
similarities_df = similarities_df.sort_values(by=['sum_similarity'], ascending=False)
return similarities_df
For example, let’s select a user with a preference for kid’s movies, and some action movies.
sample_user = 21
ratings_df[ratings_df.user_id==sample_user].sort_values(by=['rating'], ascending=False)
Out[]:
user_id | movie_title | movie_id | rating | |
---|---|---|---|---|
583304 | 21 | Titan A.E. (2000) | 3745 | 5 |
707307 | 21 | Princess Mononoke, The (Mononoke Hime) (1997) | 3000 | 5 |
70742 | 21 | Star Wars: Episode VI - Return of the Jedi (1983) | 1210 | 5 |
239644 | 21 | South Park: Bigger, Longer and Uncut (1999) | 2700 | 5 |
487530 | 21 | Mad Max Beyond Thunderdome (1985) | 3704 | 4 |
707652 | 21 | Little Nemo: Adventures in Slumberland (1992) | 2800 | 4 |
708015 | 21 | Stop! Or My Mom Will Shoot (1992) | 3268 | 3 |
706889 | 21 | Brady Bunch Movie, The (1995) | 585 | 3 |
623947 | 21 | Iron Giant, The (1999) | 2761 | 3 |
619784 | 21 | Wild Wild West (1999) | 2701 | 3 |
4211 | 21 | Bug's Life, A (1998) | 2355 | 3 |
368056 | 21 | Akira (1988) | 1274 | 3 |
226126 | 21 | Who Framed Roger Rabbit? (1988) | 2987 | 3 |
41633 | 21 | Toy Story (1995) | 1 | 3 |
34978 | 21 | Aladdin (1992) | 588 | 3 |
33432 | 21 | Antz (1998) | 2294 | 3 |
18917 | 21 | Bambi (1942) | 2018 | 1 |
612215 | 21 | Devil's Advocate, The (1997) | 1645 | 1 |
617656 | 21 | Prince of Egypt, The (1998) | 2394 | 1 |
440983 | 21 | Pinocchio (1940) | 596 | 1 |
707674 | 21 | Messenger: The Story of Joan of Arc, The (1999) | 3053 | 1 |
708194 | 21 | House Party 2 (1991) | 3774 | 1 |
Now we provide movie recommendations to the sample user by using his list of rated movies as an input.
sample_user_movies = ratings_df[ratings_df.user_id==sample_user].movie_title.tolist()
recommendations = get_movie_recommendations(sample_user_movies)
# We get the top 20 recommended movies
recommendations.movie_title.head(20)
Out[]:
1939 Lion King, The (1994) 324 Beauty and the Beast (1991) 1948 Little Mermaid, The (1989) 3055 Snow White and the Seven Dwarfs (1937) 647 Charlotte's Web (1973) 679 Cinderella (1950) 1002 Dumbo (1941) 301 Batman (1989) 3250 Sword in the Stone, The (1963) 303 Batman Returns (1992) 2252 Mulan (1998) 2924 Secret of NIMH, The (1982) 2808 Robin Hood (1973) 3026 Sleeping Beauty (1959) 1781 Jungle Book, The (1967) 260 Back to the Future Part III (1990) 259 Back to the Future Part II (1989) 2558 Peter Pan (1953) 2347 NeverEnding Story, The (1984) 97 Alice in Wonderland (1951) Name: movie_title, dtype: object
So we see that the system recommends mostly kid’s movies and some action movies. Neat!
Collaborative filtering is a widely used recommendation system nowadays. It is capable of recommending new items without having to manually define them. Also, it is able to find recommendations based on hidden features that an expert wouldn’t be able to find (for example, combination of genres or actors).
However, it has one mayor drawback. Collaborative filtering cannot recommend items for a new user until he/she has reviewed some items. This problem is called the Cold Start Issue.
One way recommender systems overcome this issue is by using a hybrid Content + Colaborative Filtering. That is, using colaborative filtering as well as content filtering when necessary.
Further reading
Here are a few interesting readings on Recommendation systems.