Data Science #5: Clustering - A Problem and A Solution
Sometimes it is easy to cluster things |
My recent 'real life' data science problem was that how we label show genres wasn't precise enough for the problems I am solving at iSpot.tv. In the same sense that showing a Tic Tac ad may be hilarious to 30% of an audience but unfunny/off putting/PTSD triggering for the other 70% (I personally think this is PTSD triggering ), calling a show a 'Drama' was an imprecise and 'human-driven' instead of data driven solution.
How we identified this problem was when we found that 'Drama' modeled typical dramas like Grey's Anatomy, Chicago Med, etc. really well but failed to capture some more niche dramas. Picture a show that appeals to couples watching late night television together who are in their mid 30s and maybe have 2 little ones. That's a very different audience than single women in their 40s or young men in their 20s. People perceive things differently, so my goal was to figure out how and then put those shows into groups.
Without going into deep psychology and doing another PhD, I decided to use a data driven approach - but what data to drive on? Should we try profiling individual TVs and aggregating those using similarity vectors? Or what about picking shows that share different features / run similar ads? There are a lot of ways to tackle obtaining groups. I tried a few different data bases for this project, but ultimately what stuck was creating groups by how well audiences watched the ads. At iSpot.tv, we have access to how long each device watches an ad for, and we'll call this 'percent viewed'. I decided that percent viewed would be my metric.
For each Ad, we had a plot like this. But Show Name isn't something that increases and decreases... |
What our data looked like going into the matrix, where color = percent viewed |
Demonstrating how the X-Y pair was achieved. Here we average the difference, so the X-Y pair value was 4 |
The next part is easy, I set a few requirements about how many matching ads had to exist for reasonable statistics and put a bar on standard deviations; if there was a massive standard deviation, that meant the data was highly variable and therefore not the best for my intention to use the result for predictions.
Now from here, I could have tried to make groups with specific labels. For example, I could have come up with 'Medical Dramas' that might have contained shows like Chicago Med and Grey's Anatomy along with House and others. However, it was a truly challenging task to define groups where all members of the group were at least somewhat similar. In fact, what I learned is that people who watch Star may watch Jerry Springer and Empire but people who watch Jerry Springer are unlikely to watch Empire. Maybe unshockingly, shows are like people too - multi-faceted.
With the multi-faceted discovery, I decided it was best just to go on a show by show basis. Now when I want to predict how an ad will do on a show, I look up the shows with the smallest ad percentage viewed difference in the table and determine the specific ad's average performance on those shows. Kind of like how Vanguard uses the S&P 500 to create a stock portfolio that weathers storms better, I like to hedge my ad performance bets - except my method is data driven and based on a large base of historical data.
I'm now in error testing and implementation of this method (so far very promising!)... so what started as a 'well it should be easy to put these into groups' turned into a 'you have to find the method that works for your problem'. Cheers everyone!
5 Comments