Data Science #5: Clustering - A Problem and A Solution

Sometimes it is easy to cluster things
So Clustering is a really open ended way in Data Science of saying I want to put things in groups that aren't already in groups. Fair, right? But how you want to put them into groups and what you're grouping are the much more difficult questions.

My recent 'real life' data science problem was that how we label show genres wasn't precise enough for the problems I am solving at In the same sense that showing a Tic Tac ad may be hilarious to 30% of an audience but unfunny/off putting/PTSD triggering for the other 70% (I personally think this is PTSD triggering ), calling a show a 'Drama' was an imprecise and 'human-driven' instead of data driven solution.

How we identified this problem was when we found that 'Drama' modeled typical dramas like Grey's Anatomy, Chicago Med, etc. really well but failed to capture some more niche dramas. Picture a show that appeals to couples watching late night television together who are in their mid 30s and maybe have 2 little ones. That's a very different audience than single women in their 40s or young men in their 20s. People perceive things differently, so my goal was to figure out how and then put those shows into groups.

Without going into deep psychology and doing another PhD, I decided to use a data driven approach - but what data to drive on? Should we try profiling individual TVs and aggregating those using similarity vectors? Or what about picking shows that share different features / run similar ads? There are a lot of ways to tackle obtaining groups. I tried a few different data bases for this project, but ultimately what stuck was creating groups by how well audiences watched the ads. At, we have access to how long each device watches an ad for, and we'll call this 'percent viewed'. I decided that percent viewed would be my metric.

For each Ad, we had a plot like this. But Show Name
isn't something that increases and decreases...
The next step was to do the actual clustering. Unfortunately, unsupervised learning methods like k-means are not quite the right answer here. Why? Because k-means need an x and y that is meaningful. The figure to the left shows what our plots looked like in this case... when looking at an Ad by Ad basis, percent viewed on the y-axis made a lot of sense; it was a number that changed in time. However, Show name (or Ad name... the two were interchangeable depending on what you wanted to group by) is not a numerical value. Of course you can assign it a numerical value, but let's say we give Chicago Med a 1 and Grey's Anatomy a 2. That means that Chicago Med and Grey's Anatomy are then spatially close on our X-Y plot, which totally fucks up our clustering mean... distance on x is arbitrarily assigned. Gross.

What our data looked like going into the matrix, where
color = percent viewed
So back to the drawing board right? We could attempt to predefine groups so that the x axis would have meaningful values, but that's a case of where we have to know the answer in order to solve the problem (d'OH). Instead, I decided to pursue something called a pair-wise matrix. It meant that I created a 2 dimensional matrix where X = Show Name, Y = Show Name. To get an X-Y pair, I first found all times ads that aired on both shows were seen (if seen multiple times, then I averaged). The X-Y value was the difference between show X's percent viewed and show Y's percent viewed for an ad by ad basis that I then averaged. So on the diagonal of this matrix was all 0s.

Demonstrating how the X-Y pair was achieved.
Here we average the difference, so the X-Y pair value was 4
At this point, if we looked at X = 1, so our Chicago Med, we'd find a string of values of how 'different' Chicago Med's. Maybe Grey's Anatomy had 3 matching ads to Chicago Med (see left figure). Here, the X-Y pair was 4%.

The next part is easy, I set a few requirements about how many matching ads had to exist for reasonable statistics and put a bar on standard deviations; if there was a massive standard deviation, that meant the data was highly variable and therefore not the best for my intention to use the result for predictions.

Now from here, I could have tried to make groups with specific labels. For example, I could have come up with 'Medical Dramas' that might have contained shows like Chicago Med and Grey's Anatomy along with House and others. However, it was a truly challenging task to define groups where all members of the group were at least somewhat similar. In fact, what I learned is that people who watch Star may watch Jerry Springer and Empire but people who watch Jerry Springer are unlikely to watch Empire. Maybe unshockingly, shows are like people too - multi-faceted.

With the multi-faceted discovery, I decided it was best just to go on a show by show basis. Now when I want to predict how an ad will do on a show, I look up the shows with the smallest ad percentage viewed difference in the table and determine the specific ad's average performance on those shows. Kind of like how Vanguard uses the S&P 500 to create a stock portfolio that weathers storms better, I like to hedge my ad performance bets - except my method is data driven and based on a large base of historical data.

I'm now in error testing and implementation of this method (so far very promising!)... so what started as a 'well it should be easy to put these into groups' turned into a 'you have to find the method that works for your problem'. Cheers everyone!



Seungjin Baek said...

Hey Lois, I like your blog. Thanks for the good readings!

Lois Keller said...

Thanks Michael!! Much appreciated :D Hope your job hunt is going well!

Anonymous said...

Great Post as always.

Charles McEachern said...

The post ended just as I was getting hungry to see your results!

Lois Keller said...

Charles, ha! I know, right? Soon enough...