Data Science #3: Analytics
I'm seeing a lot of data and statistics in this definition |
Let's walk through an example of analytics and how I would approach it. If any of you have any other ideas, please leave them in the comments so everyone can get a scope for the breadth of 'correct' solutions to an analytics challenge. Data Science is constantly evolving because there is not 'right' answer - this isn't a formula, it's an interpretation.
Temperature vs Dr. Lo's cupcake sales |
Prediction = machine learning in data science, right? OK, great, the first tool box in machine learning is linear regression, especially if you have a prior lunch that colder = lower sales and warmer = better sales (linear!).
However, one look at the data shows you that there is more of a curve fit to the data than a linear regression. You'd be amazed how many people just throw their data into a model without looking at it first! Random sample if you have to! The plot on the right shows you that linear regression would be a silly choice for telling Dr. Lo how her cupcake sales are going to go. With linear regression, the max number of sales predicted is 83 cupcakes at 100 F. That's not right...
Instead without even developing a whole predictive model for Dr. Lo, let's keep it simple. You, whether you have programming experience or not, can look at the Temperature vs Dr. Lo's cupcake sales data and see that there is a hump at 40-60 F, right? Let's try a 2nd order polynomial fit (numpy.polyfit). Curve fit gives the max predicted cupcake number of 83 at a far more appropriate 52.2 F. Much better.
Curve fit gets the hump! |
Let's think about why that might happen - people want cupcakes on nice days instead of really cold/hot days (that's totally made up, this is randomly generated data with no real bakeries involved! But pretend!). Your job as a data scientist is also to come up with a reason why things happen based on the data. Why do you need to do this? So you can go to your next step, which for me here would be to sort by season, day of week, and overall weather conditions (rainy, sunny, partly cloudy, etc.) to see if there was another dimension that actually more strongly affected cupcake sales. Caveat - be careful with parsing your data too much - too few samples is dangerous!
So in literally 5 minutes of time and plotting, you could save Dr. Lo thousands of dollars and wasted cupcakes every year. By using the power of analytics (i.e. thinking about the data in context), you can delve deeper to blend the more traditional 'data science' skills (i.e. machine learning) to create a truly powerful and profitable model.
How do you get better at analytics? A/B testing is a great way to quantify your success. You come up with an idea, you test it, and if it's better than the previous implementation, then ... success! You keep iterating until you hit the precision criteria you'd like. Mostly analytics is about combining being a person (people like cupcakes when it is... ) and identifying ways to implement your notions (should I use confidence intervals or a predictive model?).
My code to make the plots if you want to try some of your own analysis is at my (messy) github: https://github.com/loisks317/Analytics
0 Comments