The
Billboard Hot 100 is a chart used in the music industry to track the
most popular songs. Each week, the list of 100 top singles is
compiled based on records purchased and number of plays on the radio.
Having a song on this list is considered to be very prestigious for
artists and record labels, this prestige increases as the position on
the list increases, and the song stays on the list for subsequent
weeks.
The
data used here is all of the songs on the Billboard Hot 100 in the
year 2000. I was chiefly concerned with four things:
1)
The songs that reached #1.
2)
The number of weeks that songs stayed on the Hot 100.
3)
The average position on the billboard that each song held.
4)
Any factors that were associated with a song’s popularity, as
measured by items (2) and (3).
All
analyses were performed in Python 2.7. I used the NumPy, Pandas,
SciPy, and MatPlotLib libraries. Advanced visuals were created using Tableau.
First
impressions:
In
2000, there were 317 songs by 228 artists on the Billboard Hot 100.
Jay-Z had five songs on the Billboard; the most of any artist that
year. There were two songs on the Hot 100 named “Where I wanna
be”—they were unrelated, and by two different artists (Donell
Jones and Shade Sheist). There were 137 rock songs, 74 country songs,
58 rap songs, 23 RnB songs, 9 pop songs, 9 latin songs, 4
electronica, and 1 each for gospel, jazz and reggae.
For
the remaining analyses, I made use of the melt and pivot_table
functions in pandas. The data was originally presented with one row
per song, and, in addition to basic information about the song
(genre, length, artist, etc.), one column for each of the possible
weeks a song could be on the billboard (weeks 1-76). The melt
function was used to combine the 76 weekly columns into one more
manageable column.
billboardMelt
= pd.melt(billboard,
id_vars=['artist.inverted', 'track', 'time',
'genre', 'week1'],
value_vars=['x1st.week',
'x2nd.week',
'x3rd.week',
'x4th.week', 'x5th.week', 'x6th.week',
'x7th.week', 'x8th.week',
'x9th.week', 'x10th.week',
'x11th.week', 'x12th.week','x13th.week',
'x14th.week', 'x15th.week', 'x16th.week',
'x17th.week',
'x18th.week', 'x19th.week',
'x20th.week', 'x21st.week', 'x22nd.week',
'x23rd.week', 'x24th.week', 'x25th.week',
'x26th.week', 'x27th.week', 'x28th.week',
'x29th.week',
'x30th.week', 'x31st.week',
'x32nd.week', 'x33rd.week', 'x34th.week',
'x35th.week', 'x36th.week', 'x37th.week',
'x38th.week', 'x39th.week', 'x40th.week',
'x41st.week',
'x42nd.week', 'x43rd.week',
'x44th.week', 'x45th.week',
'x46th.week',
'x47th.week', 'x48th.week', 'x49th.week',
'x50th.week', 'x51st.week', 'x52nd.week',
'x53rd.week',
'x54th.week', 'x55th.week',
'x56th.week', 'x57th.week',
'x58th.week',
'x59th.week', 'x60th.week', 'x61st.week',
'x62nd.week', 'x63rd.week', 'x64th.week',
'x65th.week'],
var_name='Billboard Week', value_name='Rank')
From
here, the missing values were dropped using the dropna function.
Pivot_table was then used to summarize all of the information for
each song. E.g.:
number1pivot
= pd.pivot_table(number1,
index=['artist.inverted', 'track'],
values=['Rank'],
aggfunc=(len))
1)
During the year 2000, 17 songs held the #1 position on the Hot 100.
The
songs that held that position for the longest were:
Destiny’s
Child “Independent Women Part 1” — #1 for 11 weeks
Santana
“Maria, Maria” — #1 for 10 weeks
Savage
Garden “I knew I loved You” — #1 for 4 weeks
2)
The average number of weeks spent on the Hot 100 was 16.7. The songs
that spent the longest on the chart were:
Creed
“Higher” at 57 weeks
Lonestar
“Amazed” at 55 weeks
Faith
Hill “Breathe” at 53 weeks
3
Doors Down “Kryptonite” at 53 weeks
Creed
“With Arms Wide Open” at 47 weeks
3)
The songs with the highest average position were:
Santana
“Maria, Maria,” which averaged #10.5 across 26 weeks
Madonna
“Music,” which averaged #13.5 across 24 weeks
N’Sync
“Bye Bye Bye,” which averaged #14.3 across 21 weeks
Destiny’s
Child “Independent Women Part 1,” which averaged #14.8 across 28
weeks
4)
The dataset was very clean—only one column required cleaning. Song
length was presented as a string in a human-readable format (M:SS),
which needed to be converted into a machine-readable format (int). I
wrote a simple function to fix this:
def
timeSplit(string):
temp =
string.split(":")
minutes =
int(temp[0])
seconds =
int(temp[1])
return minutes*60
+ seconds
Unsurprisingly,
my two aggregated measures of song popularity (average position on
the chart and number of weeks spent on the chart) correlated
significantly with one another (r = -0.77). (The correlation is
negative because a better position on the chart is a lower number.)
People’s
attention spans being what they are, I suspected that song length
would have a negative effect on popularity. This turned out not to be
true. Song length predicted neither average position on the chart,
nor length of time on the chart (r = 0.026 and r = -0.024,
respectively). There may still be an association between song length
and whether or not a song made it on the chart at all, but that data
is not in front of me right now.
Week
1 rank predicted length of time spent on the chart (r2
= 0.048), average position on the chart (r2
= 0.27), and the highest position that the song would reach (r2
= 0.26).
Genre
is an interesting nut to crack here. Different genres are not
represented equally on the chart, but it is unclear why. Here we see
that, although there are far more rock songs than the other genres,
songs from different genres seem to stay on the chart for
approximately the same amount of time. However, the low sample size
for many genres makes this difficult to draw conclusions from.
![]() |
The size of the bubble indicates the average number of weeks that songs were on the chart, and the color indicates the number of songs within the genres.
The
fact that rock music occupies 43% of the songs on the chart in 2000
could be interpreted to mean that rock music is more popular than the
other genres, or that rock musicians are better at creating hits than
musicians in other genres. It could also mean, however, that more
rock songs are produced, and therefore more good songs are produced
by virtue of volume.
With
the proper data, it would be possible to determine the chance of any
given song to get on the Billboard Hot 100, and whether this chance
was different for songs of different genres. The following outline is the method I would use to solve this problem:
1. Find the number of singles recorded in a given 10-year span, and their genre.
2. Find the number of tracks that were on the billboard top 100 in the same 10-year span.
3. Calculate the probability of any given song being on the billboard (number of songs on the billboard divided by the number of songs)
4. Calculate the expected rate of each major genre being on the billboard (the percent of the total share of songs that belong to each genre).
5. Calculate the actual rate of each genre being represented on the billboard (the number of songs within a genre on the billboard, divided by the total number of songs recorded in that genre).
1. Find the number of singles recorded in a given 10-year span, and their genre.
2. Find the number of tracks that were on the billboard top 100 in the same 10-year span.
3. Calculate the probability of any given song being on the billboard (number of songs on the billboard divided by the number of songs)
4. Calculate the expected rate of each major genre being on the billboard (the percent of the total share of songs that belong to each genre).
5. Calculate the actual rate of each genre being represented on the billboard (the number of songs within a genre on the billboard, divided by the total number of songs recorded in that genre).








No comments:
Post a Comment