Creative Commons License

Blues Analysis Project


Analysis of our Blues Collection

Our Methods

In order to markup the important data, format, and syntax of each of the song lyrics files in our gigantic collection, we used oXygen's "Find and Replace in Files" tool to create REGEX which captured the regular formatting patterns which occurred throughout the blues lyrics files, then wrapped corresponding XML tags around our capturing groups accordingly.

We created XML tags to markup the following information from the files:

  • The "metadata" section, which contained the song title, performing artist, recording date/year, and album
  • The content within the "metadata" section, including all those listed above
  • The entirety of the song's lyrics, as well as each line within the lyrics
  • At times, the original website had a "Notes" section beneath the song lyrics which contained neat facts about the song, explained what certain references or allegories within the song's lyrics referred to, or fun facts about the artist who sang the song.
flamin frets

XQuery

Once we had all relevant data and formatting tagged in our collection, we used the eXist software to create and conduct XQueries which could pull and query the information we were after from our XML tags en masse using XPath functions. Once we had functioning XQueries, we used the FLWOR method of XQuerying to create SVG graphics, collections of raw text files for Python, and TSV (Tab-Separated Values) files from the queried data. From these, we created visualizations of our research results from our blues collection!(Here's the GitHub view of our XQuery Files)

Networking Analysis Graphics

Network of Interactions between Songwriters and Performing Artists

This network, created in Cytoscape, utilizes all of our "artist" and "songwriter" tags in our collection of 1,088 song XML files to create a visualization of the shared interactions between performing artists (big blue circles/rectangles) and songwriters (large and small pink squares) as they are connected through songs. Each connecting line represents song(s) - the direction of the line's arrows shows who wrote the song(dotted end), and who performed it(arrow end).

The larger pink squares represent songwriters who connect two of more performing artists together - meaning that those people wrote songs which were performed by two or more performing artists in our collection.

The red lines indicate songs that were written by a performing artist orignially, and were covered by one or more other performing artists.

The green squares represent songs that are either traditional in origins, meaning the lyrics were written a long time ago and they have been passed down through many generations to the performing arists, or unknown in origin, meaning the original writer of the song could not be confirmed by our source.

This network alone shows the massive amount of blues songs that have been passed down and shared among blues artists, in our limited collection alone.

IMPORTANT NOTE: This network does not accurately represent ALL of the interactions between blues artists and songwriters throughout the whole genre. This network's dataset is limited to our source files, which is far from being fully representative or complete data of the blues. This accounts for why B.B. King's songwriter connections is so much bigger than the others' - it's not that B.B. King ACTUALLY has more songs which connect him to more songwriters - our collection simply has a disproportionate amount of his songs, compared to the other artists in our collection. In the future we hope to add more song files from our artists and other blues artists we have no songs for, to create a more representative network for the blues as a whole!

Python and SpaCy

In order to compare aspects of the blues lyrics themselves within our collection, we used Python and spaCy's Natural Language Processor (NLP) to auto-magically analyze any collection of lyrics we give it and find parts of speech, entities, and the like within each line and produce beautiful bar graphs using pyGal. (Take a look at our Python code on GitHub)

For our purposes, we used the NLP to create visual graphics of the most common nouns, verbs, and geopolitical entities within songs which were written and performed in each decade. In doing so, we created a marvelous visual for what sorts of things blues music was serenading about, what topics or subjects may have been a popular/common hardship to have the (emotional) blues over at the time, and in the case of the geopolitical entities graphs, where these artists were emotionally experiencing and musically sharing the blues from.

Our spaCy Graphs

Geopolitical Entities in Entire Collection
Names Mentioned in Entire Collection

These graphs, created using Python's spaCy NLP and pyGal word processor, shows our entire collection's most common geopolitical enities mentioned and most common names of people mentioned within all of our song lyrics.

While our collection is indeed limited, it's interesting to contemplate why places such as Chicago, Texas, New Orleans, and Tennessee are mentioned most often in our blues songs. It could be that many of the songs were written and/or performed in these places, and it's reasonable to conclude that they were referenced so many times in our collection of blues songs due to hardships being endured in these places most often. It could also be argued that these places are mentioned most often because they are sung about in a positive light - maybe these are places that blues artists and songwriters wished to escape to, instead of from.

The graph representing most common names of people mentioned within our collection's lyrics is harder to decipher - there are a number of reasons why these names in particular were mentioned so many times throughout songs. The fact that half of these names are presumably female may relfect that many of these blues songs were expressing the woes of heartbreak, and it's possible that these female names are used as a general reference to a significant other. It's also fascinating that Uncle Sam is on this list - as it reflects how many of our collection's songs were written to express the woes and neglect brought upon by America's government to many of our collecton's artists and songwriters at the time.

NOTE: Again, this graph is not comprehensive. Some work was done to fix bugs that spaCy created when formulating these entity word frequencies - there may be some inaccuracies, beyond the fact that our dataset itself is incomprehensive.


Most Common Nouns Appearing in our Blues Collection through the Decades

Our project team decided to run all of the lyrics from each song and group them together by the decade in which they were performed, then ran each decade of song lyrics through spaCy's NLP in PyCharm, using Python code to automatically grab and calculate the ten most common nouns used in each decade of blues lyrics. We executed this experiment in hopes of finding interesting correlations between the era in which songs were performed, and the popular subjects of the blues songs at the time.

Keep in mind, these graphs represent findings from non-comprehensive data; more raw source lyrics are needed in certain decades to more accurately and evenly represent the most common nouns in blues songs throughout the decades. The 1920's, 1930's, and 1940's are three of the decades we have the least amount of song lyric data from, as you can see by the generally smaller amount of nouns appearing in the graphs we've created from them.