Adventures in Data

Sunday, November 6, 2016

Predicting Movie Quality

Netflix is pretty good at determining what types of movies I want to watch, but they are not always good at recommending movies that are good. In this analysis, I will make a predictive model for determining whether movies are good, based on features of those movies.

The Data:

Data was collected for 501 movies from IMDB. 250 movies were taken from 250 top rated movies on IMDB, and the remaining 251 were taken from the IMDB bottom 100 (the 100 lowest-rated movies), and from a Wikipedia list of box office flops. These additional 251 movies served as a reference point against which I could compare the most highly rated movies.

Using IMDBpie, data on the IMDB rating, genres, runtime, MPAA rating, and year were collected for each movie. Many movies were classified under more than one genre—an additional variable was created for the number of genres that a movie belonged to. The mode number of genres was 3.

Although the distribution of movie ratings was skewed heavily to the left, the full range of ratings was present in this dataset without any significant gaps between the lowest-rated and highest-rated movies. Movies with ratings higher than 8 were classified as “good” for the purpose of this model.

A wide range of release years were included in this dataset. Release year ranged from 1921 to 2016, with a median year of 1998.

The most common MPAA rating in this dataset was R, with 160 movies. Some movies had archaic MPAA ratings, or foreign ratings. As these only accounted for a small portion of the data, non-standard ratings were ignored.

Of the 19 genres in the dataset, Drama was the most common, with 248 movies. The least common was Music, with 9 movies.

Modeling:

A random forest model was used to predict movie quality. The model used 26 variables to predict quality: year, runtime and number of genres, as well as variables for MPAA rating and dummy variables for each of the 19 genres used to classify these movies.

A grid search was used to determine the best hyperparameters for this model. Iterating over a range of max_features, max_depth, and n_estimators, the following model was returned:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',

         max_depth=None, max_features=0.25, max_leaf_nodes=None,

         min_samples_leaf=1, min_samples_split=2,

         min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,

         oob_score=False, random_state=None, verbose=0,

         warm_start=False)

Here is a sample tree (one of 100):

When applied to the test data, this model had a score of 0.73 (base probability was 0.5 for this dataset). The model was slightly better at predicting whether a movie would be not good than it was at predicting whether a movie would be good (precision = 0.76 and 0.70, respectively).

The most important features in this model were movie length, year, and whether the movie was a drama. The remaining features had much less impact on the model.

Future Directions:

Future analyses should consider plot elements, as well as cast and directors. Because there are so many directors and actors, a larger dataset would be needed to detect the effects of these factors.

In addition to its use in recommending movies to viewers, this model could also be used to inform movie production.

For the purpose of movie production, a regression model may also be useful. While it is important to be able to predict whether a movie will have a rating of 8 or higher, being able to predict the actual rating would also be useful. For example, it may be worth producing a movie with a predicted rating of 7 if increasing the predicted rating to 8 would require increasing the budget by an unacceptable amount.

Monday, October 10, 2016

Top Hits of 2000

The Billboard Hot 100 is a chart used in the music industry to track the most popular songs. Each week, the list of 100 top singles is compiled based on records purchased and number of plays on the radio. Having a song on this list is considered to be very prestigious for artists and record labels, this prestige increases as the position on the list increases, and the song stays on the list for subsequent weeks.

The data used here is all of the songs on the Billboard Hot 100 in the year 2000. I was chiefly concerned with four things:

1) The songs that reached #1.

2) The number of weeks that songs stayed on the Hot 100.

3) The average position on the billboard that each song held.

4) Any factors that were associated with a song’s popularity, as measured by items (2) and (3).

All analyses were performed in Python 2.7. I used the NumPy, Pandas, SciPy, and MatPlotLib libraries. Advanced visuals were created using Tableau.

First impressions:

In 2000, there were 317 songs by 228 artists on the Billboard Hot 100. Jay-Z had five songs on the Billboard; the most of any artist that year. There were two songs on the Hot 100 named “Where I wanna be”—they were unrelated, and by two different artists (Donell Jones and Shade Sheist). There were 137 rock songs, 74 country songs, 58 rap songs, 23 RnB songs, 9 pop songs, 9 latin songs, 4 electronica, and 1 each for gospel, jazz and reggae.

For the remaining analyses, I made use of the melt and pivot_table functions in pandas. The data was originally presented with one row per song, and, in addition to basic information about the song (genre, length, artist, etc.), one column for each of the possible weeks a song could be on the billboard (weeks 1-76). The melt function was used to combine the 76 weekly columns into one more manageable column.

billboardMelt = pd.melt(billboard,

id_vars=['artist.inverted', 'track', 'time',

'genre', 'week1'],

value_vars=['x1st.week', 'x2nd.week',

'x3rd.week', 'x4th.week', 'x5th.week', 'x6th.week',

'x7th.week', 'x8th.week', 'x9th.week', 'x10th.week',

'x11th.week', 'x12th.week','x13th.week',

'x14th.week', 'x15th.week', 'x16th.week',

'x17th.week', 'x18th.week', 'x19th.week',

'x20th.week', 'x21st.week', 'x22nd.week',

'x23rd.week', 'x24th.week', 'x25th.week',

'x26th.week', 'x27th.week', 'x28th.week',

'x29th.week', 'x30th.week', 'x31st.week',

'x32nd.week', 'x33rd.week', 'x34th.week',

'x35th.week', 'x36th.week', 'x37th.week',

'x38th.week', 'x39th.week', 'x40th.week',

'x41st.week', 'x42nd.week', 'x43rd.week',

'x44th.week', 'x45th.week', 'x46th.week',

'x47th.week', 'x48th.week', 'x49th.week',

'x50th.week', 'x51st.week', 'x52nd.week',

'x53rd.week', 'x54th.week', 'x55th.week',

'x56th.week', 'x57th.week', 'x58th.week',

'x59th.week', 'x60th.week', 'x61st.week',

'x62nd.week', 'x63rd.week', 'x64th.week',

'x65th.week'],

var_name='Billboard Week', value_name='Rank')

From here, the missing values were dropped using the dropna function. Pivot_table was then used to summarize all of the information for each song. E.g.:

number1pivot = pd.pivot_table(number1,

index=['artist.inverted', 'track'],

values=['Rank'],

aggfunc=(len))

1) During the year 2000, 17 songs held the #1 position on the Hot 100.

The songs that held that position for the longest were:

Destiny’s Child “Independent Women Part 1” — #1 for 11 weeks

Santana “Maria, Maria” — #1 for 10 weeks

Savage Garden “I knew I loved You” — #1 for 4 weeks

2) The average number of weeks spent on the Hot 100 was 16.7. The songs that spent the longest on the chart were:

Creed “Higher” at 57 weeks

Lonestar “Amazed” at 55 weeks

Faith Hill “Breathe” at 53 weeks

3 Doors Down “Kryptonite” at 53 weeks

Creed “With Arms Wide Open” at 47 weeks

3) The songs with the highest average position were:

Santana “Maria, Maria,” which averaged #10.5 across 26 weeks

Madonna “Music,” which averaged #13.5 across 24 weeks

N’Sync “Bye Bye Bye,” which averaged #14.3 across 21 weeks

Destiny’s Child “Independent Women Part 1,” which averaged #14.8 across 28 weeks

4) The dataset was very clean—only one column required cleaning. Song length was presented as a string in a human-readable format (M:SS), which needed to be converted into a machine-readable format (int). I wrote a simple function to fix this:

def timeSplit(string):

temp = string.split(":")

minutes = int(temp[0])

seconds = int(temp[1])

return minutes*60 + seconds

Unsurprisingly, my two aggregated measures of song popularity (average position on the chart and number of weeks spent on the chart) correlated significantly with one another (r = -0.77). (The correlation is negative because a better position on the chart is a lower number.)

People’s attention spans being what they are, I suspected that song length would have a negative effect on popularity. This turned out not to be true. Song length predicted neither average position on the chart, nor length of time on the chart (r = 0.026 and r = -0.024, respectively). There may still be an association between song length and whether or not a song made it on the chart at all, but that data is not in front of me right now.

Week 1 rank predicted length of time spent on the chart (r² = 0.048), average position on the chart (r² = 0.27), and the highest position that the song would reach (r² = 0.26).

Genre is an interesting nut to crack here. Different genres are not represented equally on the chart, but it is unclear why. Here we see that, although there are far more rock songs than the other genres, songs from different genres seem to stay on the chart for approximately the same amount of time. However, the low sample size for many genres makes this difficult to draw conclusions from.

The size of the bubble indicates the average number of weeks that songs were on the chart, and the color indicates the number of songs within the genres.

The fact that rock music occupies 43% of the songs on the chart in 2000 could be interpreted to mean that rock music is more popular than the other genres, or that rock musicians are better at creating hits than musicians in other genres. It could also mean, however, that more rock songs are produced, and therefore more good songs are produced by virtue of volume.

With the proper data, it would be possible to determine the chance of any given song to get on the Billboard Hot 100, and whether this chance was different for songs of different genres. The following outline is the method I would use to solve this problem:
1. Find the number of singles recorded in a given 10-year span, and their genre.
2. Find the number of tracks that were on the billboard top 100 in the same 10-year span.
3. Calculate the probability of any given song being on the billboard (number of songs on the billboard divided by the number of songs)
4. Calculate the expected rate of each major genre being on the billboard (the percent of the total share of songs that belong to each genre).

5. Calculate the actual rate of each genre being represented on the billboard (the number of songs within a genre on the billboard, divided by the total number of songs recorded in that genre).

Sunday, October 2, 2016

Musings on SAT Scores

The Scholastic Aptitude Test (SAT) is a test taken by many high school students in the USA for the purpose of college admissions. The SAT is one of two tests used for this purpose. The other is the ACT (not an acronym). Traditionally, the SAT was divided into two sections—verbal and math—which were scored separately. Possible scores range from 200 up to 800 for each section. In a given year, scores are comparable to one another, but the test scores across years may not be, as they may be scored and calibrated differently. In 2005, a writing section was added, but this discussion will only pertain to the math and verbal portions. Both the SAT and ACT are generally accepted by colleges, and states vary widely in the proportion of students that take one test vs the other. The east and west coast states tend to favor the SAT, while the midwest states favor the ACT. Scores on the SAT correlate highly with IQ (and are therefore very relevant to college admission), but are also depend strongly on education and English literacy (Deary, 2001).

Here is a heat map showing the preference for the SAT across the lower 48 states:

Heat map showing the percent of students taking the SAT across the lower 48 states. Image prepared using Tableau.

Across the 50 states and Washington, DC, the mean rate at which students take the SAT is 37% (median is 33%). This distribution is strongly bimodal, with one mode at 50-60% (including Washington, Oregon and Florida) and the other mode at 4-10% (including South Dakota, Mississippi and Iowa).

Unsurprisingly, math and verbal SAT scores correlate strongly and positively (r = 0.90, p < 0.001). Ohio is a notable outlier with average verbal scores (534), but the lowest math score in the data set (439). Across states, the mean math score is 532 (median is 525), and the mean verbal score is 533 (median is 527). However, the distribution of verbal scores are also quite strongly bimodal, with one mode at approximately 500, and the other mode at approximately 560.

Both math and verbal SAT scores correlate strongly and negatively with the percent of high school students taking the test (r = -0.77 and r = -0.89, respectively. P < 0.001). This relationship is particularly clear when comparing the SAT rate to the math scores. The data points are clustered into two groups--high rate, low scores, and low rate, high scores:

Scatterplot showing average state SAT math scores, and the percent of students taking the SAT. Image prepared using Matplotlib in Python 2.7.

A plausible explanation for this pattern is that it is common in many east coast high schools for students to be encouraged to take the SAT even if they do not intend to apply to college. Thus, students take the SAT at a rate of 82% in Connecticut, and the rest of the New England states are not far behind. Conversely, it may only the best students in ACT-dominant states who take the SAT. This is particularly plausible, as some scholars believe the SAT to be a more rigorous test (e.g. Kanazawa, 2008), and students wishing to get into the most prestigious schools may choose to take the SAT to better showcase their abilities.

References:

Deary, I.J. 2001. Intelligence: A Very Short Introduction. Oxford University Press, New York.

Kanazawa, S. 2008. IQ and the health of states. Biodemography and Social Biology, 54(2): 200-213.