Adventures in Data: Top Hits of 2000

The Billboard Hot 100 is a chart used in the music industry to track the most popular songs. Each week, the list of 100 top singles is compiled based on records purchased and number of plays on the radio. Having a song on this list is considered to be very prestigious for artists and record labels, this prestige increases as the position on the list increases, and the song stays on the list for subsequent weeks.

The data used here is all of the songs on the Billboard Hot 100 in the year 2000. I was chiefly concerned with four things:

1) The songs that reached #1.

2) The number of weeks that songs stayed on the Hot 100.

3) The average position on the billboard that each song held.

4) Any factors that were associated with a song’s popularity, as measured by items (2) and (3).

All analyses were performed in Python 2.7. I used the NumPy, Pandas, SciPy, and MatPlotLib libraries. Advanced visuals were created using Tableau.

First impressions:

In 2000, there were 317 songs by 228 artists on the Billboard Hot 100. Jay-Z had five songs on the Billboard; the most of any artist that year. There were two songs on the Hot 100 named “Where I wanna be”—they were unrelated, and by two different artists (Donell Jones and Shade Sheist). There were 137 rock songs, 74 country songs, 58 rap songs, 23 RnB songs, 9 pop songs, 9 latin songs, 4 electronica, and 1 each for gospel, jazz and reggae.

For the remaining analyses, I made use of the melt and pivot_table functions in pandas. The data was originally presented with one row per song, and, in addition to basic information about the song (genre, length, artist, etc.), one column for each of the possible weeks a song could be on the billboard (weeks 1-76). The melt function was used to combine the 76 weekly columns into one more manageable column.

billboardMelt = pd.melt(billboard,

id_vars=['artist.inverted', 'track', 'time',

'genre', 'week1'],

value_vars=['x1st.week', 'x2nd.week',

'x3rd.week', 'x4th.week', 'x5th.week', 'x6th.week',

'x7th.week', 'x8th.week', 'x9th.week', 'x10th.week',

'x11th.week', 'x12th.week','x13th.week',

'x14th.week', 'x15th.week', 'x16th.week',

'x17th.week', 'x18th.week', 'x19th.week',

'x20th.week', 'x21st.week', 'x22nd.week',

'x23rd.week', 'x24th.week', 'x25th.week',

'x26th.week', 'x27th.week', 'x28th.week',

'x29th.week', 'x30th.week', 'x31st.week',

'x32nd.week', 'x33rd.week', 'x34th.week',

'x35th.week', 'x36th.week', 'x37th.week',

'x38th.week', 'x39th.week', 'x40th.week',

'x41st.week', 'x42nd.week', 'x43rd.week',

'x44th.week', 'x45th.week', 'x46th.week',

'x47th.week', 'x48th.week', 'x49th.week',

'x50th.week', 'x51st.week', 'x52nd.week',

'x53rd.week', 'x54th.week', 'x55th.week',

'x56th.week', 'x57th.week', 'x58th.week',

'x59th.week', 'x60th.week', 'x61st.week',

'x62nd.week', 'x63rd.week', 'x64th.week',

'x65th.week'],

var_name='Billboard Week', value_name='Rank')

From here, the missing values were dropped using the dropna function. Pivot_table was then used to summarize all of the information for each song. E.g.:

number1pivot = pd.pivot_table(number1,

index=['artist.inverted', 'track'],

values=['Rank'],

aggfunc=(len))

1) During the year 2000, 17 songs held the #1 position on the Hot 100.

The songs that held that position for the longest were:

Destiny’s Child “Independent Women Part 1” — #1 for 11 weeks

Santana “Maria, Maria” — #1 for 10 weeks

Savage Garden “I knew I loved You” — #1 for 4 weeks

2) The average number of weeks spent on the Hot 100 was 16.7. The songs that spent the longest on the chart were:

Creed “Higher” at 57 weeks

Lonestar “Amazed” at 55 weeks

Faith Hill “Breathe” at 53 weeks

3 Doors Down “Kryptonite” at 53 weeks

Creed “With Arms Wide Open” at 47 weeks

3) The songs with the highest average position were:

Santana “Maria, Maria,” which averaged #10.5 across 26 weeks

Madonna “Music,” which averaged #13.5 across 24 weeks

N’Sync “Bye Bye Bye,” which averaged #14.3 across 21 weeks

Destiny’s Child “Independent Women Part 1,” which averaged #14.8 across 28 weeks

4) The dataset was very clean—only one column required cleaning. Song length was presented as a string in a human-readable format (M:SS), which needed to be converted into a machine-readable format (int). I wrote a simple function to fix this:

def timeSplit(string):

temp = string.split(":")

minutes = int(temp[0])

seconds = int(temp[1])

return minutes*60 + seconds

Unsurprisingly, my two aggregated measures of song popularity (average position on the chart and number of weeks spent on the chart) correlated significantly with one another (r = -0.77). (The correlation is negative because a better position on the chart is a lower number.)

People’s attention spans being what they are, I suspected that song length would have a negative effect on popularity. This turned out not to be true. Song length predicted neither average position on the chart, nor length of time on the chart (r = 0.026 and r = -0.024, respectively). There may still be an association between song length and whether or not a song made it on the chart at all, but that data is not in front of me right now.

Week 1 rank predicted length of time spent on the chart (r² = 0.048), average position on the chart (r² = 0.27), and the highest position that the song would reach (r² = 0.26).

Genre is an interesting nut to crack here. Different genres are not represented equally on the chart, but it is unclear why. Here we see that, although there are far more rock songs than the other genres, songs from different genres seem to stay on the chart for approximately the same amount of time. However, the low sample size for many genres makes this difficult to draw conclusions from.

The size of the bubble indicates the average number of weeks that songs were on the chart, and the color indicates the number of songs within the genres.

The fact that rock music occupies 43% of the songs on the chart in 2000 could be interpreted to mean that rock music is more popular than the other genres, or that rock musicians are better at creating hits than musicians in other genres. It could also mean, however, that more rock songs are produced, and therefore more good songs are produced by virtue of volume.

With the proper data, it would be possible to determine the chance of any given song to get on the Billboard Hot 100, and whether this chance was different for songs of different genres. The following outline is the method I would use to solve this problem:
1. Find the number of singles recorded in a given 10-year span, and their genre.
2. Find the number of tracks that were on the billboard top 100 in the same 10-year span.
3. Calculate the probability of any given song being on the billboard (number of songs on the billboard divided by the number of songs)
4. Calculate the expected rate of each major genre being on the billboard (the percent of the total share of songs that belong to each genre).

5. Calculate the actual rate of each genre being represented on the billboard (the number of songs within a genre on the billboard, divided by the total number of songs recorded in that genre).

Adventures in Data

Monday, October 10, 2016

Top Hits of 2000

No comments:

Post a Comment