Why Phish Data?

19 Apr 2020 -

I’m about two weeks into my big dive into teaching myself data science and everyone says the first big step is to do my own project. Create something that is interesting to me and that will be a way to push myself to figure out new things when I hit roadblocks and it can provide some insight to potential employers about how I think and solve problems. My first idea was Phish stats. I’ve spent that majority of the last two weeks trying to convince myself it was a bad idea and to think of something more practical to start with. I failed at this task and kept finding new reasons that it is the right place to start.

But, aren’t there already a lot of Phish stats online and even a deep learning project that predicts setlists with like 20% accuracy? Yeah, there are, and I’m excited to learn how to code better by trying to replicate some of the things you can already find from zzyzx or phantasytour. I’m also excited to satisfy my curiosities and answer my own questions like: Are there songs that are more likely to be played in a certain month or day of the week? (beyond Aud Lang Syne and Friday), Are there certain songs that are more likely to end up in a show together? (other than a Mike’s Groove or Horse>Silent), Do some Halloween sets have more impact on future setlists than others? I’m also excited to hear questions from other people and try to find interesting ways to answer them.

But, arent’ Phish setlists pretty random and trying to create a predictive model would be essentially impossible? Yeah, and that’s perfect for what I’m trying to do. If I wanted to figure out how to code a deep learning model like the one linked above or one of the more basic language predictor ones that have popped up in the forums, I probably could do that, move on, and be able to say I did it. But that fact that the setlists are hard to predict will keep me from focusing too much on feature engineering just for the sake of getting a better model and will instead allow me to focus on interesting statistics, finding ways those stats interact with each other and determining how to communicate what I find to others. I also think it would be kinda cool to build a model up for the stats like a rules based language model instead of letting a neural network take all the fun out of it.

But, like Phish stats aren’t going to be that interesting to that many people. I know. But, the Phish world is special community filled with thoughtful people that care about community, doing things they love, and being in the moment. If this work somehow ends up catching the eye of a fan who is hiring or looking for help on a project, then I could be pretty confident I’d be working with someone whose priorities align with my own.

Who knows, maybe I’ll get my ass handed to me every day, and realize that I screwed up. Then do it two more times.