Movie Recommendation System

Published in

Web Mining [IS688, Spring 2021]

8 min readApr 28, 2021

To provide an insight into how recommendation systems are designed and built from a coding perspective, I am trying to demonstrate how a simple recommendation system in Python works. I will also be showing another recommendation system which is a bit more complex and would provide better recommendations.

As you might have often noticed while shopping online, we are displayed recommended products. If you buy a bed, the online experience will recommend you buy a mattress along with it. Often times, if you are shopping for a home appliance, pre-built packages with similar features are recommended. I will analyze data from historical purchases to identify things that are frequently bought together.

As it indicates, a recommendation engine is a system that suggests a user, based on:

· his/her specific behavior

· similar behavior by other users

· an algorithm which predicts user’s most likely action or behavior

We like it or not, but recommendation systems have become a part of life. Be it Netflix where we get a list of recommended movies, or Instacart which recommends grocery products, recommending systems exist everywhere on the Internet.

How do Recommender systems work?

Let’s try to figure out how these systems work? How are they designed? What goes behind the scenes? At a very basic level, these systems work on machine learning algorithms which can be classified into the following two categories content-based and collaborative filtering.

Content-based focuses on item similarity based on attributes on how it is similar to other items based on users likes, prior actions or feedback (surveys or ratings).

Collaborative filtering operates differently where algorithms use similarities between users and items simultaneously to provide recommendations. Essentially, the underlying models are designed to recommend an item to a user based on interests of another similar user.

A popular use of a recommendation engines is by e-commerce platforms. Have you ever purchased an item from an online store and had additional items identified by the system as those you may also be interested in buying? If so, then you’ve encountered a purchase recommendation engine.

Recommendation Engine In Python: Data

A recommendation engine is as efficient and intelligent as the data that is fed into it. In this article, I will be building a movie recommendation system using Top 250 Movies using metadata collected from IMDB. The following are the steps involved:

Decide on the metric or score to rate movies on.
Calculate the score for every movie.
Sort the movies based on the score and output the top results.

The dataset contains metadata for around 45k movies listed in MovieLens for the movies released on or before July 2017. The dataset has the following attributes — cast, crew, plot, budget, revenue, posters, release date, languages, production companies, countries, TMDB vote counts, and vote averages. I will be focussing on these attributes to train my machine learning model for content and collaborative filtering. Following files are contained in the dataset which can be accessed from the following location: https://www.kaggle.com/rounakbanik/the-movies-dataset/data

movies_metadata.csv: File has information on approx. 45k movies having posters, backdrops, budget, genre, revenue, release dates, production countries, languages and companies information.
keywords.csv: File has information on the movie plot keywords for our MovieLens movies.
credits.csv: File has information on the cast and crew for all the movies.
links.csv: This file has the TMDB and IMDB IDs of all the movies present in the Full MovieLens dataset.
links_small.csv: The file has the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.
ratings_small.csv: It is a subset of around 100,000 ratings from 700 users for 9,000 movies.

Loading Data

# Import Pandas
import pandas as pd# Loading Movies Metadata
metadata = pd.read_csv('movies_metadata.csv', low_memory=False)# Print the first three rows from the dataset
metadata.head(3)

One of the most basic metrics we can think of is the ranking to figure out top movies based on their respective ratings. But, using ‘Rating’ as a measure has a few drawbacks:

It doesn’t consider how popular the movie is. As such, a movie with a 9 rating from 10 voters will be considered ‘better’ than a movie with a 8.8 rating from 10,000 voters.
This will also favor movies with a smaller number of voters with skewed and/or extremely high ratings.

Considering the above, we should come up with a weighted rating which will take the average rating and the number of votes it has accumulated. This will ensure that a movie with a 9 rating from 100k votes gets a higher score than a movie with the same rating with a few hundred votes.

‘v’ is the number of votes — vote_count

‘m’ is the minimum votes required to be listed — chosing it as 90th percentile

‘R’ is the average rating of the movie — vote_average

‘C’ is the mean vote across the whole report

Building a simple Recommender

Step 1: Calculate the value of ‘C’, which is the mean rating across all movies

# Calculating the mean of vote average columnC = metadata['vote_average'].mean()
print(C)

From this output, we can see that the average movie rating on IMDB is around 5.62 on a scale of 10.

Step 2: Calculating the number of votes, ‘m’ received by a movie in the 90th percentile.

# Calculate minimum number of votes required to be in the chartm = metadata['vote_count'].quantile(0.90)
print(m)

Step 3: Filtering out movies which have more than 160 votes

# Filtering out qualified movies into a new DataFrame
# We have 'm' as 160 calculated aboveq_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

Step 4: Calculating the weighted average

# Function that computes the weighted rating of each movie def weighted_rating(x, m=m, C=C):     
    v = x['vote_count']     
    R = x['vote_average']     # Calculation based on the IMDB formula     
    
    return (v/(v+m) * R) + (m/(m+v) * C)# Create 'score' and calculate its value with `weighted_rating()'
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)#Sort movies based on score calculated above 
q_movies = q_movies.sort_values('score', ascending=False)  #Print the top 10 movies 
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

The above chart has a number of movies which are also present in the the IMDB Top 250 chart. We all know how amazing the movies “Shawshank Redemption” and “The Godfather” are. DDLG is an all time favorite Indian movie.

Content-Based Recommender — based on Credits, Genres and Keywords

The quality of our recommender can be greatly increased by using better metadata and capturing more finer details. In this part of the article, we will be building a recommender system based on the top 3 actors, director, genres and the movie plot keywords.

Step 1: Since the keywords, cast, and crew data is not available in our current dataset, as such, I am loading and merging this data into my main DataFrame which is ‘metadata’.

# Load keywords and credits data
credits = pd.read_csv('credits.csv') 
keywords = pd.read_csv('keywords.csv')  # Removing bad IDs. 
metadata = metadata.drop([19730, 29503, 35587])  # Convert IDs to int for merging 
keywords['id'] = keywords['id'].astype('int') 
credits['id'] = credits['id'].astype('int') 
metadata['id'] = metadata['id'].astype('int')  # Merge keywords and credits into your main metadata dataframe metadata = metadata.merge(credits, on='id') 
metadata = metadata.merge(keywords, on='id')

Step 2: From the new features, cast, crew, and keywords, I am extracting the three most important actors, the director and the keywords associated with that movie.

# Parse string features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

Step 3: Get the director’s name from the crew feature. If the director is not listed, return NaN

# Import Numpy 
import numpy as npdef get_director(x):     
    for i in x:         
        if i['job'] == 'Director':             
            return i['name']     
    return np.nan

Step 4: Writing a function which returns the top 3 elements or the entire list, whichever is more. Here the list refers to the cast, keywords, and genres.

# Define new director, cast, genres and keywords features 
metadata['director'] = metadata['crew'].apply(get_director)  features = ['cast', 'keywords', 'genres'] 
for feature in features:     
    metadata[feature] =      metadata[feature].apply(get_list)# Print the new features of the first 3 films 

metadata[['title','cast','director','keywords','genres']].head(3)

Step 5: Preprocessing Data by removing spaces, converting into lowercase. Removing the spaces between words is performed so that your vectorizer doesn’t count the James of “James Maddisson” and “James Vardy” as the same. After this processing step, the above will be represented as “jamesmaddisson” and “jamesvardy” which will be distinct to our vectorizer.

# Function to convert strings to lower case & strip names of spaces def clean_data(x):     
    if isinstance(x, list):         
        return [str.lower(i.replace(" ", "")) for i in x]     
    else:         
#Check if director exists. If not, return empty string        
    if isinstance(x, str):             
       return str.lower(x.replace(" ", ""))         
    else:             
       return ''# Apply clean_data function to our features. 
features = ['cast', 'keywords', 'director', 'genres']  for feature in features:     
    metadata[feature] = metadata[feature].apply(clean_data)

Step 6: Final preprocessing step to feed to vectorizer (namely actors, director and keywords)

# Create a new soup feature 
metadata['soup'] = metadata.apply(create_soup, axis=1)metadata[['soup']].head(2)

Step 7: We will be using the CountVectorizer() because we don’t want to down-weight the actors or directors if they have acted or directed in relatively more movies.

# Import CountVectorizer and create the count matrix 
from sklearn.feature_extraction.text import CountVectorizer  count = CountVectorizer(stop_words='english') count_matrix = count.fit_transform(metadata['soup'])count_matrix.shape

There are 73,881 vocabularies in the metadata

Step 8: Using Cosine Similarity to measure the distance between the embeddings

# Compute the Cosine Similarity matrix based on the count_matrix from sklearn.metrics.pairwise import cosine_similarity  cosine_sim2 = cosine_similarity(count_matrix, count_matrix)# Reset index of main DataFrame & construct reverse mapping 
metadata = metadata.reset_index() indices = pd.Series(metadata.index, index=metadata['title'])

Step 9: Getting ready for Recommendations???

get_recommendations('The Dark Knight Rises', cosine_sim2)

Movies similar to “The Dark Knight Rises”

get_recommendations('The Godfather', cosine_sim2)

Conclusion

Great! We observe that the recommender system we just built has been successful in capturing more information due to more metadata and has provided us with better recommendations.

Obviously there are, many other ways of experimenting with this system to improve the recommendations.

We can introduce a popularity filter
Use information such as ‘Other crew members’
Use Box Office collections as e measure of how popular or successful a movie was at the time of its release
Overall collections till date, and many more.

References:

The Movies Dataset

Metadata on over 45,000 movies. 26 million ratings from over 270,000 users.

www.kaggle.com

Cosine similarity - Wikipedia

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to…

en.wikipedia.org

https://en.wikipedia.org/wiki/Recommender_system