Beg

Data Science #4.2: Intro Machine Learning Problem - Random Forest


It can be overwhelming to get your feet with Machine Learning, between determining what algorithm you might want to try and even finding a data set to play with. Hopefully this post will make you feel better about giving ML a shot in Python and maybe even applying it to your own problems.

Imagine if buying a Titanic Ticket had come with a 'Change of Survival' 
I'm going to go ahead an use a publicly available data set - The Titanic Survivors - that's popular because of the ongoing Kaggle Competition (which is a GREAT place to start if you're looking for problems to work on while applying for jobs, etc.).

Things you'll need:

Titanic Data Set

Python

Scikit learn and the Scipy stack installed (no idea what I'm talking about? https://www.scipy.org/install.html)

Some time, patience, and a drive to predict the future

Let's get cracking!

Goal: Could we have determined Titanic passenger survival based on information known at the time of departure? 


First things first, make sure you have the data set in an accessible location and let's open it up into a pandas dataframe.

df=pd.read_csv('Titanic.csv')

Sweet! Now let's check out what's in it:

df.columns.values
Out[1]: array(['Unnamed: 0', 'Name', 'PClass', 'Age', 'Sex', 'Survived', 'SexCode'], dtype=object)

Those are our variables that we're going to try to predict who survives and who doesn't survive. Well, except for the 'Survived' column - that's our solution. The 'Unnamed' column is just where pandas read the file in and didn't know what to call the index. Open up this data set and take a peek - there are a lot of gaps and holes. We're going to keep things simple (note: this is not always the best answer!!) and remove all rows in our data frame that have NaNs.

# first, let's clean up our data set
df=df.dropna()

Great! We reduced our data set from 1312 rows to 756 rows (yikes!). You can interpolate to fill some of these NaNs, but let's start easy knowing we're sacrificing accuracy. Next step is to reduce the number of columns, setting aside our Survived column and reducing the duplicates 'Sex' and 'Sex Code' down to just 'Sex' which is a numerical value. We also drop the name column because it's a text column and (in this case) will not enter our algorithm well (plus, a little intuition, should name length matter? Maybe... but that'd be weird). 

# then let's put the solution off to the side
# and drop the labeled 'male/female' column in favor of the
# sex code one (female = 1, male = 0)
SolutionCol = df['Survived']
df=df.drop(['Survived', 'Sex', 'Name', 'Unnamed: 0'], axis=1)

Unfortunately data cleaning is just part of the job. We also have to change the PClass column from '1st, 2nd, etc.' to '1, 2, etc.'

# then we have to fix PClass from 1st, 2nd, etc. to just 1, 2, 3
# once again, we need numerical values 
df['PClass'] = df['PClass'].map(lambda x: str(x)[0])

Now we're down to 756 rows and 3 columns. But before we build our model, it's imperative that we 'split' our data into test/train sets. The idea is how do you tell someone how good your model is if you use all the data? If the model is genuinely predictive, you should be able to use some of the data to predict the remaining. However, it's a fine line... using too much for testing means that you have less to train on. A good place to start is 3/4 train, 1/4 test.

# now split up train-test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
     df, SolutionCol, test_size=0.25, random_state=1)

The random state variable allows you to start from this exact configuration again if you so choose. To really verify that our model is/isn't consistent, we'll try different states. Now... let's actually set up the model using sci-kit learn in Python. We're going to use a classifier here because we want a yes/no answer - will a person survive or not? We don't need a regressor - a 60% dead person isn't actually informative for us. I chose Random Forest because it's one of my favorites, but maybe we'll play around with this a bit more with another classifier in the future.

# setup the model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
predictions=clf.predict(X_test)

It's that easy, really. I mean, the devil is in the details, but the initial set up is that ridiculously easy - you just built a high powered statistical model. BAM!

Next step? Score it! How good is our model! Let's use that test data set!

# score the model
# how accurate was the model overall
from sklearn.metrics import accuracy_score
acc=accuracy_score(y_test, predictions)

# what was it's precision
from sklearn.metrics import precision_score
precision=precision_score(y_test, predictions)

# what is the TP, FP, FN, TN rate? (true positive, false positive, etc.) 
from sklearn.metrics import confusion_matrix
confusionMatrix=confusion_matrix(y_test, predictions)

Results????

accuracy is: 79%
precision is: 0.83606557377
confusion matrix: [[99 10]
 [29 51]]

The confusion matrix should be read as "Correctly Predicted True, Incorrectly True, Incorrectly False, Correctly False". We're doing pretty well for a first run! However, I'd personally never submit this to Kaggle... we can tune our features to slowly increase our accuracy and improve our model. But this is a good start :)

Speaking of which - what mattered the most?

# what were the most important features?
list(zip(df.columns.values, clf.feature_importances_))

'PClass', 0.18,
'Age', 0.44, 
'SexCode', 0.38











Age was the biggest determiner of survival in the Titanic accident, followed by male/female, and then your fare class. Fascinating right? Based on my 4th grade readings about the Titanic, I would have thought it was all the rich who survived...

Just as a quick test, remember how we set the random state = 1 for our train/test random splitting? Let's set it to 2 and see how the accuracy changes:


accuracy is: 75%
precision is: 0.686567164179
confusion matrix: [[97 21]
 [25 46]]
Uh oh, accuracy changed a bit, and precision quite a bit. This means depending on what train/test combination we use, the more sensitive our model is to favoring certain answers. In this second test, we had way more false positives (everyone died!). Depending on what your intended outcome is, this can be good/bad, but the large changes based on the random state aren't great.

So we have some work to do on this model. Let's call this a blog post, but work on tuning our model next time.

Want to see the script? Don't Hesitate to Create A New Branch :)

and the next in this series is here: http://www.innocentheroine.com/2017/01/data-science-43-improving-random-forest.html

You Might Also Like

60 Comments