Data Science #3: Analytics

I'm seeing a lot of data and statistics in this definition
Analytics is tough to teach because it's a fancy term for problem solving with data. Sounds easy, but then if it's easy, why do we need data scientists, especially ones with PhDs? Are you good at pattern recognition? Are you good with noticing things that other people don't notice? Are you good at showing and explaining these patterns/trends to other people? Then congratulations - you have a good mental platform for analytical work.

Let's walk through an example of analytics and how I would approach it. If any of you have any other ideas, please leave them in the comments so everyone can get a scope for the breadth of 'correct' solutions to an analytics challenge. Data Science is constantly evolving because there is not 'right' answer - this isn't a formula, it's an interpretation.

Temperature vs Dr. Lo's cupcake sales 
Our example: Dr. Lo has opened a bakery with her PhD! She's marked her number of sales of cupcakes based on the temperature outside. She's strongly suspicious that people don't buy cupcakes when it's cold, so she wants to better predict how many cupcakes should be sold based on the temperature outside.

Prediction = machine learning in data science, right? OK, great, the first tool box in machine learning is linear regression, especially if you have a prior lunch that colder = lower sales and warmer = better sales (linear!).

However, one look at the data shows you that there is more of a curve fit to the data than a linear regression. You'd be amazed how many people just throw their data into a model without looking at it first! Random sample if you have to! The plot on the right shows you that linear regression would be a silly choice for telling Dr. Lo how her cupcake sales are going to go. With linear regression, the max number of sales predicted is 83 cupcakes at 100 F. That's not right...

Instead without even developing a whole predictive model for Dr. Lo, let's keep it simple. You, whether you have programming experience or not, can look at the Temperature vs Dr. Lo's cupcake sales data and see that there is a hump at 40-60 F, right? Let's try a 2nd order polynomial fit (numpy.polyfit). Curve fit gives the max predicted cupcake number of 83 at a far more appropriate 52.2 F. Much better.

Curve fit gets the hump! 

Let's think about why that might happen - people want cupcakes on nice days instead of really cold/hot days (that's totally made up, this is randomly generated data with no real bakeries involved! But pretend!). Your job as a data scientist is also to come up with a reason why things happen based on the data. Why do you need to do this? So you can go to your next step, which for me here would be to sort by season, day of week, and overall weather conditions (rainy, sunny, partly cloudy, etc.) to see if there was another dimension that actually more strongly affected cupcake sales. Caveat - be careful with parsing your data too much - too few samples is dangerous!

So in literally 5 minutes of time and plotting, you could save Dr. Lo thousands of dollars and wasted cupcakes every year. By using the power of analytics (i.e. thinking about the data in context), you can delve deeper to blend the more traditional 'data science' skills (i.e. machine learning) to create a truly powerful and profitable model.

How do you get better at analytics? A/B testing is a great way to quantify your success. You come up with an idea, you test it, and if it's better than the previous implementation, then ... success! You keep iterating until you hit the precision criteria you'd like. Mostly analytics is about combining being a person (people like cupcakes when it is... ) and identifying ways to implement your notions (should I use confidence intervals or a predictive model?).

My code to make the plots if you want to try some of your own analysis is at my (messy) github:

You Might Also Like