Data Science #4.2: Intro Machine Learning Problem - Random Forest
It can be overwhelming to get your feet with Machine Learning, between determining what algorithm you might want to try and even finding a data set to play with. Hopefully this post will make you feel better about giving ML a shot in Python and maybe even applying it to your own problems.
Imagine if buying a Titanic Ticket had come with a 'Change of Survival' |
Things you'll need:
Titanic Data Set
Python
Scikit learn and the Scipy stack installed (no idea what I'm talking about? https://www.scipy.org/install.html)
Some time, patience, and a drive to predict the future
Let's get cracking!
Goal: Could we have determined Titanic passenger survival based on information known at the time of departure?
First things first, make sure you have the data set in an accessible location and let's open it up into a pandas dataframe.
df=pd.read_csv('Titanic.csv')
Sweet! Now let's check out what's in it:
df.columns.values Out[1]: array(['Unnamed: 0', 'Name', 'PClass', 'Age', 'Sex', 'Survived', 'SexCode'], dtype=object)
Those are our variables that we're going to try to predict who survives and who doesn't survive. Well, except for the 'Survived' column - that's our solution. The 'Unnamed' column is just where pandas read the file in and didn't know what to call the index. Open up this data set and take a peek - there are a lot of gaps and holes. We're going to keep things simple (note: this is not always the best answer!!) and remove all rows in our data frame that have NaNs.
Great! We reduced our data set from 1312 rows to 756 rows (yikes!). You can interpolate to fill some of these NaNs, but let's start easy knowing we're sacrificing accuracy. Next step is to reduce the number of columns, setting aside our Survived column and reducing the duplicates 'Sex' and 'Sex Code' down to just 'Sex' which is a numerical value. We also drop the name column because it's a text column and (in this case) will not enter our algorithm well (plus, a little intuition, should name length matter? Maybe... but that'd be weird).
Unfortunately data cleaning is just part of the job. We also have to change the PClass column from '1st, 2nd, etc.' to '1, 2, etc.'
Now we're down to 756 rows and 3 columns. But before we build our model, it's imperative that we 'split' our data into test/train sets. The idea is how do you tell someone how good your model is if you use all the data? If the model is genuinely predictive, you should be able to use some of the data to predict the remaining. However, it's a fine line... using too much for testing means that you have less to train on. A good place to start is 3/4 train, 1/4 test.
The random state variable allows you to start from this exact configuration again if you so choose. To really verify that our model is/isn't consistent, we'll try different states. Now... let's actually set up the model using sci-kit learn in Python. We're going to use a classifier here because we want a yes/no answer - will a person survive or not? We don't need a regressor - a 60% dead person isn't actually informative for us. I chose Random Forest because it's one of my favorites, but maybe we'll play around with this a bit more with another classifier in the future.
It's that easy, really. I mean, the devil is in the details, but the initial set up is that ridiculously easy - you just built a high powered statistical model. BAM!
Next step? Score it! How good is our model! Let's use that test data set!
Results????
The confusion matrix should be read as "Correctly Predicted True, Incorrectly True, Incorrectly False, Correctly False". We're doing pretty well for a first run! However, I'd personally never submit this to Kaggle... we can tune our features to slowly increase our accuracy and improve our model. But this is a good start :)
Speaking of which - what mattered the most?
Age was the biggest determiner of survival in the Titanic accident, followed by male/female, and then your fare class. Fascinating right? Based on my 4th grade readings about the Titanic, I would have thought it was all the rich who survived...
Just as a quick test, remember how we set the random state = 1 for our train/test random splitting? Let's set it to 2 and see how the accuracy changes:
# first, let's clean up our data set df=df.dropna()
Great! We reduced our data set from 1312 rows to 756 rows (yikes!). You can interpolate to fill some of these NaNs, but let's start easy knowing we're sacrificing accuracy. Next step is to reduce the number of columns, setting aside our Survived column and reducing the duplicates 'Sex' and 'Sex Code' down to just 'Sex' which is a numerical value. We also drop the name column because it's a text column and (in this case) will not enter our algorithm well (plus, a little intuition, should name length matter? Maybe... but that'd be weird).
# then let's put the solution off to the side # and drop the labeled 'male/female' column in favor of the # sex code one (female = 1, male = 0) SolutionCol = df['Survived'] df=df.drop(['Survived', 'Sex', 'Name', 'Unnamed: 0'], axis=1)
Unfortunately data cleaning is just part of the job. We also have to change the PClass column from '1st, 2nd, etc.' to '1, 2, etc.'
# then we have to fix PClass from 1st, 2nd, etc. to just 1, 2, 3 # once again, we need numerical values df['PClass'] = df['PClass'].map(lambda x: str(x)[0])
Now we're down to 756 rows and 3 columns. But before we build our model, it's imperative that we 'split' our data into test/train sets. The idea is how do you tell someone how good your model is if you use all the data? If the model is genuinely predictive, you should be able to use some of the data to predict the remaining. However, it's a fine line... using too much for testing means that you have less to train on. A good place to start is 3/4 train, 1/4 test.
# now split up train-test data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( df, SolutionCol, test_size=0.25, random_state=1)
The random state variable allows you to start from this exact configuration again if you so choose. To really verify that our model is/isn't consistent, we'll try different states. Now... let's actually set up the model using sci-kit learn in Python. We're going to use a classifier here because we want a yes/no answer - will a person survive or not? We don't need a regressor - a 60% dead person isn't actually informative for us. I chose Random Forest because it's one of my favorites, but maybe we'll play around with this a bit more with another classifier in the future.
# setup the model from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier() clf.fit(X_train, y_train) predictions=clf.predict(X_test)
It's that easy, really. I mean, the devil is in the details, but the initial set up is that ridiculously easy - you just built a high powered statistical model. BAM!
Next step? Score it! How good is our model! Let's use that test data set!
# score the model # how accurate was the model overall from sklearn.metrics import accuracy_score acc=accuracy_score(y_test, predictions) # what was it's precision from sklearn.metrics import precision_score precision=precision_score(y_test, predictions) # what is the TP, FP, FN, TN rate? (true positive, false positive, etc.) from sklearn.metrics import confusion_matrix confusionMatrix=confusion_matrix(y_test, predictions)
Results????
accuracy is: 79% precision is: 0.83606557377 confusion matrix: [[99 10] [29 51]]
The confusion matrix should be read as "Correctly Predicted True, Incorrectly True, Incorrectly False, Correctly False". We're doing pretty well for a first run! However, I'd personally never submit this to Kaggle... we can tune our features to slowly increase our accuracy and improve our model. But this is a good start :)
Speaking of which - what mattered the most?
# what were the most important features? list(zip(df.columns.values, clf.feature_importances_)) 'PClass', 0.18, 'Age', 0.44, 'SexCode', 0.38
Age was the biggest determiner of survival in the Titanic accident, followed by male/female, and then your fare class. Fascinating right? Based on my 4th grade readings about the Titanic, I would have thought it was all the rich who survived...
Just as a quick test, remember how we set the random state = 1 for our train/test random splitting? Let's set it to 2 and see how the accuracy changes:
accuracy is: 75% precision is: 0.686567164179 confusion matrix: [[97 21] [25 46]]
So we have some work to do on this model. Let's call this a blog post, but work on tuning our model next time.
Want to see the script? Don't Hesitate to Create A New Branch :)
and the next in this series is here: http://www.innocentheroine.com/2017/01/data-science-43-improving-random-forest.html
60 Comments