Data Science #4.1: Machine Learning
Machine Learning? |
The goal of this post/series is to get you to (C) - you're not going to be an expert after reading this post but I'm hoping this will give a better understanding of what machine learning is, what types are there, why people are so obsessed with it, and how you can start learning too.
What is Machine Learning
Linear Regression. Now that's not scary. |
What Types are There?
How many different types of machine learning algorithms are there? A lot - because it's using statistics to derive a trend (see image below, the green boxes are machine learners). But don't just think statistics = machine learning. For example, confidence intervals are not machine learning. They have no predictive or 'fill in' element. Keep that in mind when you hear machine learning... it's advanced (and effective) statistics with predictive power.
Thank you scikit learn - this is the best machine learning 'road map' I've found to date (personally) |
OK, so now you want to try to use machine learning to solve your problem (you think) and you're completely overwhelmed by the road map above. The next goal is to think about what type of problem you have. This is where the terminology starts to bog down, so bear with me.
Random Forest, what a great name |
It's critical you decide what type of data you have on hand and what your goal is. Are you trying to find something cool about how your data clusters into groups (small boys and older women like cheese?) but have no idea what it may be? Go unlabeled, try clustering and give a shot at kmean and kmeans mini-batch.
Are you trying to determine if patients have yes/no cancer based on a list of questions they are asked at their doctor's office? Maybe the way for you to go is classification, and you may find a random forest or SVM algorithm very powerful.
In my job day to day, I'm currently working with a random forest regression model. I have a supervised problem (labeled data, woo!) and I'm trying to predict the future based off a long list of 'features' (re: data column labels). I'll talk more about this method in detail later... but that's another post.
Why are people so obsessed with Machine Learning?
We have some much data now (thank you PBS for the graphic!) |
Are you Netflix trying to build a recommendation engine to get people to watch more shows? Are you trying to determine if people who have certain symptoms have a disease? Are you working with terabytes of data and trying to get something meaningful from it? Are you trying to determine if people will buy groceries based on what color the box of cereal is? Or things like this - robo pharmacies? Machine Learning can give you an answer and even tell you how good it is with things like feature ranking and standard deviation built in.
Even for simple problems, Machine Learning doesn't take that long to implement and can lead to long ranging benefits down the road. If you want to be a data scientist, this is a tool you need to be comfortable with and have in your toolkit... hence scikit learn in Python :)
How you can start Learning Machine Learning
First, look at that road map graphic before in section 2.... that's where you start. You start with math and problem solving skills - what are you trying to answer, how much data do you have, and what would you consider a successful result. I've always used Machine Learning for predictive methods so my models needed to be right to be successful. Maybe yours doesn't need to be as accurate... +/- 10% would be an improvement for instance.
Your next step is to use resources like this blog, google, stack overflow, and your brain power. I still have trouble finding answers to questions I have when using ML methods.... you will too. Ask questions and try implementing. In the next week, I am going to write a walkthrough for Random Forest. Stay tuned, and thanks for reading all :)
1 Comments