Predicting Phish Set Closers

22 Jun 2020 -

As mentioned in a couple of my previous posts, I decided to make a simple predictor for my first machine learning project. I wanted to be able to predict if the song that Phish just started playing was going to close the set. I figured this would be simple enough to understand, practical (“Should I run to the bathroom before the lines get long during setbreak?) and could build a foundation for a few more complicated models (like how many more songs are they going to play).

Equate My Life with Sand

Soundcheck

I wanted to keep this model as simple as possible, so I wanted to keep it to only three simple inputs: what song is being played, how many songs have already been played, and what time is it? The model could combine those simple inputs in the background into more robust features, but if someone is at a show, they won’t be trying to remember what other songs were played, how many songs were longer than twenty minutes, or any other features that could increase the accuracy of the model. So with these constraints, I jumped into making the model.

Set 1

After brainstorming a bunch of different ways to combine the data, I ended up with 25 different features for each track played:

  1. The song being played
  2. Whether or not it was the set closer (what we are trying to predict)
  3. The song number for the total show
  4. What set the song is in
  5. The song number for the set it is in
  6. A rolling average of songs per set for the previous 10, 50, and 100 shows. (How many songs were in the first set on average over the last 50 shows)
  7. A fraction of the song number in the set over the rolling averages mentioned above. (If song 9 just started and there have been an average of 10 songs in the first set over the last 50 shows then the model would say that this song was 90% into an average set.)
  8. The percentage of time the specific song opens a set, is the second song in a set, is the song before the set closer, and closes a set.
  9. How many minutes into the set did the song start?
  10. A rolling average of set durations over the last 10, 50, and 100 shows.
  11. The same fraction as in number 7 but with the amount of time into the set over the average set duration (instead of using the song number.)
  12. An average set placement for the song being played over the last 10, 50, and 100 shows. (If a song started at the 20 minute mark of an 80 minute set then it would have a set placement of .25, then this number averages the set placement for the particular song over the most recent shows.)

Many of these data points I described in my other posts on this topic. The rest are mainly just different combinations of those.

Setbreak

I know that there was a high chance that Jim would be the set closer for set 1, night 1 of Fenway 2019, so I left a little early to beat the bathroom line.

Set 2

Now that I had all my data, it was time to try to train a model. I had learned about the lightgbm model for determining a binary (is it the set closer or not) from the feature engineering mini-course on Kaggle and it seemed like the right model to use for my first ML project. I lifted all of the code and started making adjustments to fit my model.

As I expected, I had a little trouble updating the code to match my data. I didn’t realize that I hadn’t saved my data with the right extension or with headers, and I forgot to update a couple of the column headers. But, I got that cleaned up fast enough and the next thing I realized was that the song title and set number had to be changed to numbers (since set E is not a number). For the song title, I decided to go with the count of times the song shows up in the data. This was mainly because there are a ton of songs that are played 1 - 3 times and since they haven’t been played enough times to create a pattern, it seems to make more sense to consider all of them as a group. (Will a debut song end the set seems like a better question than will You Ain’t Goin’ Nowhere or Free Man in Paris end the set.) Once I used an encoder to do this step, I had a little trouble figuring out when to drop the categorical data in the code and how to make sure I did it correctly for the training, validation, and test data. But, fortunately, the code I pulled from Kaggle made that easier to figure out than I expected. (My first model actually was 100% accurate but that was because I left in the column that said if the song was a set closer or not so the model had two columns with the same data and thus it was a direct mapping, oops.)

The way that the lightgbm model works is that you take all of your data and split it into three groups. A training set, a validation set, and a testing set. You pass in the training set (which in most cases is about 80 percent of the data) and the model automatically generates a bunch of decision trees and uses an iterative process to determine which order to put the decision points in, what boundary levels to set for each decision and it also creates a bunch of different trees and combines their results to ultimately create an output for the probability that every song is the set closer. Then, the model goes through and makes a final categorization for every track in the set by essentially flagging the highest probabilities as the set closers and then checking to see if they were right. During this time, the model consistently checks to see how each new version of the model works on a smaller set called the validation set, which is about 10 percent of the data. Every time a new decision is added to the tree or a slight adjustment is made to the boundary level, the model checks to see how accurate it is against this 10 percent model, if a change makes it more accurate then it keeps it and moves to a new tweak, if not, it drops it and tries a different one. Eventually the model stops when it has tried a bunch of new tweaks and none of them make the model any better.

After a model has been created with the process above, you pass in some test data to see how accurate the model is against data that wasn’t available when it was being created. That value is the test score you can see on the dashboard. To make the model more interactive, I made it so you could pick any show and have the model be created based on just the shows before that show. For example, if you put in a show from the Baker’s Dozen, the modeling shouldn’t include a bunch of Kasvot Voxt songs, so as soon as you pick a date, the model updates by first dropping out all the data from after that date, then creating the training, validation, and testing sets by taking 80 percent of the data from the first show ever to whatever the cutoff would be for 20% before the chosen date, then the validation data comes from the next 10 percent of shows. The testing data is the most recent 10% of shows since you would want your model to be accurate for the most recent shows.

Another common aspect in modeling is limiting the number of features that you shove into the model. Most of the time having 25 different features is overkill not only because it leads to a ton of extra computing that usually only improves the accuracy by a tiny amount but also because it leads to what is called “over-fitting” where the model ends up being extremely accurate but only because there is so much information. For this issue, I used a process that grabs the most important features only then creates the mode based on those. And you get to choose how many features in the bottom right drop down menu. Those top features will change depending on the show and you can see their relative importance in the graph on the bottom left. The absolute height of the bar in this graph isn’t very important, but the tallest bar correlates to the feature that contributes the most information to the probabilities that were generated.

Finally, only the songs from the show that was chosen is split into a sliver of data to create the prediction percentages and calculate the accuracy for that specific show. Ultimately, I don’t think people start wondering if it’s the last song until about 45 minutes into the set but it was easy to add a drop down and let you determine how many songs to predict so you can see the whole setlist of just songs that start after the 60 minute mark. (This number gets adjusted automatically to make sure that there are at least 4 songs in the prediction set because the prediction algorithm doesn’t work unless there is at least one song that isn’t the closer.) The model then checks the probabilities against whether or not the song was actually a set closer and that is where the “predict score” comes from.

Next Steps:

Ultimately, I’d like to turn this isn’t a dynamic thing that updates when new shows are played and is always accessible during a show so you can pull it up before you decide if you want to beat the line to the Camden ferry. Maybe a twitter or Reddit bot that can respond to posts with the closing percentages. I’ll have to learn a lot about web scraping and updating the background database with setlists and song lengths after every show. But, I have at least a year to figure that out before the next shows.

There are a few limitations to the model that I’ll likely improve on in the future. The first is some improvements to the input data that I used. One example is adjusting the “percent closes” feature to be for recent shows instead of for all time. Coil and Hood have closed sets about 20% of the time that they were played but for the last five years or so, both songs have been the set closer more like 75% of the times they were played. So, I will likely add in a feature like “percent closes in the last 100 shows” where it checks to see what percentage of the time the song closed a set in the last 100 shows instead of over the whole tenure of the song. Another stat I left out was song duration which could be important. Both Cavern and Slave close sets about the same percentage of the time but if you are 65 minutes into a set then there’s probably a better chance that a song that’s about twice as long will be the set closer.

You might wonder why the early songs have probabilities that are like 1-3% when really it should be zero. That’s because there are a bunch of “shows” in the Phish.in database that were TV show appearances or soundchecks so the model factors those in. There is a similar problem including shows that are just one set because those first sets tend to be much longer. In a future version, I will likely drop those appearance and soundcheck shows and analyze whether I should drop one set shows too. Also, encores are messy for various reasons. I might just need to make a separate model that predicts how many encore songs they will play instead of trying to guess if the first encore song is the last encore song.

It would also probably make sense to make completely separate models for each set in the show and have a model that focuses on if a song is going to close the first set and another for the second set. Not only would that make the modeling a little more accurate in general, it would avoid the issue like in 11-29-19 where it correctly assigns the highest probability in each set to Fire and Walk Away but ultimately the model predict score isn’t 100% because Sand (the next to last set 1 song) had a slightly higher percent chance of being a set closer as Walk Away did.

Finally, one of the main goals of using models like this is to make tons of changes and use different types of models until you find the one that is the most accurate and runs the fastest. I didn’t do that with this model because a) that process would come up with a different set of inputs and model optimizations for every single show and b) this is a fun exercise in trying to force this process onto something that ultimately doesn’t matter. Similar models may help you predict at halftime who is going to win a basketball game or which people are most likely to decide that they can’t vote for a racist for president again. In those cases it could make a big difference to make the model a percentage point more accurate. But, not for this.

As always, I’m open to feedback, questions and feature requests! My code is up on github and uses Python, Flask, and Dash.