According to the Nielsen Company over 93% of Americans listen to music each week, with estimates of time spent listening to music around 25 hours per week on average.

The vast majority of that music is popular “Top 40” music, and most of that music has lyrics. Given this popularity and prevalence, it is surprising to find that there is no organized corpus of song lyrics available for analysis, consigning most studies of lyrics to small samples of 100 songs or so.

This project uses a large, searchable database of American popular song lyrics. Specifically, the lyrics of the Billboard Top 100 songs of the year, for each of the years 1960 to present, i.e., the lyrics to 5600 songs, almost 300 hours worth of music. As mentioned, this would represent a lyrical corpus larger than any other by an order of magnitude. A stripped-down version of the database would be “Ngrammed” in line with Google’s Ngram Viewer for books (https://books.google.com/ngrams).

The full database would be available for academic use to our faculty and students for study and analysis. Topics that have been investigated previously using small sets of lyrics—sentiment analysis, use of narcissistic language, contextual use of the word “love”, just to name a few—can be revisited, and new questions of interest to particular projects can be asked. (At least one psychology faculty member has already approached me about the project.) Such analysis can be readily performed using inexpensive specialty software (e.g., LIWC) or using open-source software already in use in our CS department (e.g. Weka).

Project Details

The song lyrics were parsed using the APIs available at meaningcloud.com. Each song was parsed for words, word counts, lemmas, counts of lemmas, and part of speech. The results were stored in a MySQL database. The search and data extraction is handled by PHP scripts, and the data graphing is done with basic implementations D3 Javascript library.

Various examples are available from the navigational links, including some API uses to get JSON arrays of word and lemma counts, as well as song titles.