Data Science #4.1: Machine Learning

Machine Learning? 
So my latest pet peeve over the last few months is how people drop 'machine learning' as a hot word to sound intelligent. It'll come up in conversation like "Oh yeah, I'd love to do some machine learning on this problem I'm working on" or "I think deep learning is your best solution" [how someone can say that with a straight face, I have no idea]. When you drop machine learning as a hot word, your audience is going to have 1 of 3 responses: (A) wow! They are using words like AI - they must be really smart! (B) This person probably knows more about this than I do, I'll just BS too or (C) I think they have no idea what they are talking about.

The goal of this post/series is to get you to (C) - you're not going to be an expert after reading this post but I'm hoping this will give a better understanding of what machine learning is, what types are there, why people are so obsessed with it, and how you can start learning too.

What is Machine Learning

Linear Regression. Now that's not scary. 
Without just straight stealing definitions from google that you could find yourself, machine learning is a suite of statistical methods that are used in conjunction to either 'predict' a solution or 'fill in' a solution based on known parameters. The simplest machine learner is... linear regression. Actually, the simplest would probably be the ole 4th grade method of drawing a best fit line through a data set. The most advanced would probably be deep learning, which I am not going to touch here.

What Types are There? 

How many different types of machine learning algorithms are there? A lot - because it's using statistics to derive a trend (see image below, the green boxes are machine learners). But don't just think statistics = machine learning. For example, confidence intervals are not machine learning.  They have no predictive or 'fill in' element. Keep that in mind when you hear machine learning... it's advanced (and effective) statistics with predictive power.
Thank you scikit learn - this is the best machine learning 'road map' I've found to date (personally)

OK, so now you want to try to use machine learning to solve your problem (you think) and you're completely overwhelmed by the road map above. The next goal is to think about what type of problem you have. This is where the terminology starts to bog down, so bear with me.

Random Forest, what a great name 
 There are essentially two kinds of problems supervised (labeled) and unsupervised (unlabeled). Labeled data means that you have columns with numbers and some sort of 'result' attached, whether it's 'did watch' 'didn't dream', etc. . Unlabeled corresponds to having data that may be in nice neat categories like "gender", "favorite song", etc. but you don't actually know if there is a significant result, like that boys prefer cheese to pepperoni. Still no idea what I'm talking about? Check out this example: labeled vs unlabeled .

It's critical you decide what type of data you have on hand and what your goal is. Are you trying to find something cool about how your data clusters into groups (small boys and older women like cheese?) but have no idea what it may be? Go unlabeled, try clustering and give a shot at kmean and kmeans mini-batch.

Are you trying to determine if patients have yes/no cancer based on a list of questions they are asked at their doctor's office? Maybe the way for you to go is classification, and you may find a random forest or SVM algorithm very powerful.

In my job day to day, I'm currently working with a random forest regression model. I have a supervised problem (labeled data, woo!) and I'm trying to predict the future based off a long list of 'features' (re: data column labels). I'll talk more about this method in detail later... but that's another post.

Why are people so obsessed with Machine Learning? 

We have some much data now
(thank you PBS for the graphic!)
Machine learning is so powerful it sometimes actually blows my mind. No, it's not a crystal ball and ML experts are not fortune tellers. However, machine learning does take a lot of burden off of humans (re: prone to error) and works through data in an incredibly fast rate to really give you an impressive result.

Are you Netflix trying to build a recommendation engine to get people to watch more shows? Are you trying to determine if people who have certain symptoms have a disease? Are you working with terabytes of data and trying to get something meaningful from it? Are you trying to determine if people will buy groceries based on what color the box of cereal is? Or things like this - robo pharmacies? Machine Learning can give you an answer and even tell you how good it is with things like feature ranking and standard deviation built in.

 Even for simple problems, Machine Learning doesn't take that long to implement and can lead to long ranging benefits down the road. If you want to be a data scientist, this is a tool you need to be comfortable with and have in your toolkit... hence scikit learn in Python :)

How you can start Learning Machine Learning

First, look at that road map graphic before in section 2.... that's where you start. You start with math and problem solving skills - what are you trying to answer, how much data do you have, and what would you consider a successful result. I've always used Machine Learning for predictive methods so my models needed to be right to be successful. Maybe yours doesn't need to be as accurate... +/- 10% would be an improvement for instance.

Your next step is to use resources like this blog, google, stack overflow, and your brain power. I still have trouble finding answers to questions I have when using ML methods.... you will too. Ask questions and try implementing. In the next week, I am going to write a walkthrough for Random Forest. Stay tuned, and thanks for reading all :)

You Might Also Like