What Do I Need To Know About This Data?
01 May 2020 -
When I first decided to teach myself some data science concepts by analyzing Phish data and making some predictive models, I had seen enough data online to know that there was a version of a database somewhere that listed all setlists. I looked into and saw that the main two being used were phish.net and phish.in. phish.net seems more comprehensive and standardized, but it seems like it isn’t open source due to their relationship with the Mockingbird Foundation. Phish.in has some missing shows and quirks in the data but includes track lengths which will be important to some of the project ideas I have. I didn’t want to spend my first few hours with the dataset looking for of its flaws so I jumped into my first two projects, weekday bias and showing song frequency over time. But, before I published anything, I thought it best to look into how messy the data really was.
Missing Shows
Phish.in doesn’t have all the shows Phish has played with known setlists. So, inherently there will be songs missings from totals. Fortunately, most of these shows are in the 1989 - 1992 range when they were playing tons of shows so it won’t have as big of an impact on any data trends than if a large chunk of 3.0 shows were missing. Phish.in also includes a bunch of extra shows like festival soundchecks and TV appearances. Since these are usually only a handful of songs, it shouldn’t have a big impact on data trends either. I will need to drop these shows from some of the data sets when I get to some of the modeling ideas I have.
Lost in the Segues
Phish.in is sourced from audience recordings and the tracks from those recordings are then titled by whoever is processing the recording for upload. This leads to a lot of inconsistency around track titles. It looks like jcraigk and the people who put together phish.in did a great job of creating song aliases so that differences in spelling, spacing, and capitalization are almost completely fixed. The bigger issue comes from inconsistency in how tracks are broken up. For example, about half of the time a Alumni > Letter to Jimmy Page > Alumni sandwich is listed as one track and the other half it is broken up. There are also over 20 YEMs that have segues into other songs and back. I didn’t even try to count the number of times that Hold Your Head Up shows up in the list of 604 tracks with segues in the title. It’s a lot. This is definitely a pretty significant issue in some projects where I’m comparing frequency of songs played. I’m not yet sure how I would fix that within my own database while maintaining all the other data like set placement and song duration. And, I doubt that will be fixed by anyone else on the project because splitting up 604 tracks for listening is not going to be worth anyone’s time. For some projects, it might be worth spending some time accounting for these 604 tracks manually before publishing anything.
Long and Short Tracks
Another component the data is how the inconsistency above impacts track duration. On the high end, eight tracks that are longer than thirty minutes (of 54 total tracks more than 30min) and thirty one of the tracks between twenty and thirty minutes (of 471 total tracks) contain a segue and at least two songs. Some of these are straightforward like Harry Hood > Fireworks Jam would be considered one song by current conventions. And others will need a closer look like Tweezer > Kung > Tweezer. There are also some tracks with dead space between songs that would artificially inflate the length of some tracks. On the low end it is a bit more complicated. Due to the audience recordings, there are a ton of tracks that are cut short and not full versions of songs. The biggest extreme is the three second Reba on 3/11/90. But, there are also tons of short versions of songs as part of segue-fests, including three Simple tracks on 6/17/94, one being only ten seconds long. I don’t think there’s a simple answer in determining which of these issues to drop as outliers, I’ll figure that out when I get to that project.
Overall, the data is definitely tricky but it doesn’t seem any more complicated than a typical data set I might encounter in the education field, so it will be good practice to clean it up.