Netflix is pretty good at determining what types of movies I want to watch, but they are not always good at recommending movies that are good. In this analysis, I will make a predictive model for determining whether movies are good, based on features of those movies.
The Data:
Data was collected for 501 movies from IMDB. 250 movies were taken from 250 top rated movies on IMDB, and the remaining 251 were taken from the IMDB bottom 100 (the 100 lowest-rated movies), and from a Wikipedia list of box office flops. These additional 251 movies served as a reference point against which I could compare the most highly rated movies.
Using IMDBpie, data on the IMDB rating, genres, runtime, MPAA rating, and year were collected for each movie. Many movies were classified under more than one genre—an additional variable was created for the number of genres that a movie belonged to. The mode number of genres was 3.
Although the distribution of movie ratings was skewed heavily to the left, the full range of ratings was present in this dataset without any significant gaps between the lowest-rated and highest-rated movies. Movies with ratings higher than 8 were classified as “good” for the purpose of this model.
A wide range of release years were included in this dataset. Release year ranged from 1921 to 2016, with a median year of 1998.
The most common MPAA rating in this dataset was R, with 160 movies. Some movies had archaic MPAA ratings, or foreign ratings. As these only accounted for a small portion of the data, non-standard ratings were ignored.
Of the 19 genres in the dataset, Drama was the most common, with 248 movies. The least common was Music, with 9 movies.
Modeling:
A random forest model was used to predict movie quality. The model used 26 variables to predict quality: year, runtime and number of genres, as well as variables for MPAA rating and dummy variables for each of the 19 genres used to classify these movies.
A grid search was used to determine the best hyperparameters for this model. Iterating over a range of max_features, max_depth, and n_estimators, the following model was returned:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features=0.25, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
Here is a sample tree (one of 100):
When applied to the test data, this model had a score of 0.73 (base probability was 0.5 for this dataset). The model was slightly better at predicting whether a movie would be not good than it was at predicting whether a movie would be good (precision = 0.76 and 0.70, respectively).
The most important features in this model were movie length, year, and whether the movie was a drama. The remaining features had much less impact on the model.
Future Directions:
Future analyses should consider plot elements, as well as cast and directors. Because there are so many directors and actors, a larger dataset would be needed to detect the effects of these factors.
In addition to its use in recommending movies to viewers, this model could also be used to inform movie production.
For the purpose of movie production, a regression model may also be useful. While it is important to be able to predict whether a movie will have a rating of 8 or higher, being able to predict the actual rating would also be useful. For example, it may be worth producing a movie with a predicted rating of 7 if increasing the predicted rating to 8 would require increasing the budget by an unacceptable amount.












