Data Science #4.3: Improving the Random Forest

Continuing on from our initial Random Forest Model predicting Titanic survivors - you feel good! You feel like you made a pretty great model in < 50 lines of code. I mean, 75% accurate? Awesome!
However, your boss comes back and says no one is going to buy 75% accurate Titanic survival estimators... they want at least 90%. Ouch.

Now comes the hard part - model tuning to get marginal accuracy gains. With random forest (in particular), the first you thing you should consider when you go back to the drawing board is: is there a way for me to get more features (columns) for the RF model? Right now for our Titanic model we have a measly 3 - age, gender, and class. That's nothing. However, we're going to come back to this in a bit... let's say your boss wants a model just off those three things, nothing else. Ugh.

Option 2: Add more data. Here, we can't really 'generate' more data using a monte carlo simulation or go back in time and put more passengers on the Titanic. However...

Remember all that NaNs I threw the first round? What if... we didn't throw them out. What if we made a 'guess' at what the missing values are by using the median value of that feature? So for age, let's say age spanned from 0 to 87 and the median was 34... it wouldn't be a bad guess to just put 34 for in whenever we didn't have an age... or would it?

This is called imputation - filling missing values with a 'most likely result'. Let's try 4 cases - no imputation, imputation with median, imputation with mean, and imputation with most frequent value for Age to see how this comes out. We will put into a loop and test each one 100 times for significance. The code snippet below shows how Imputer works, where strategy can be 'mean', 'median', or 'most_frequent'. We apply 'pd.to_numeric' to the test and train set because without dropping the NaNs, there's a weird * in one of the columns (data cleaning, UGH!). The fit_transform applies the imputation.

  from sklearn.preprocessing import Imputer
  imp = Imputer(missing_values='NaN', strategy='median', axis=0)
  X_train = X_train.apply(pd.to_numeric, errors='coerce')
  X_test = X_test.apply(pd.to_numeric, errors='coerce')

Cool! So what do some of our results look like (Caution! Boring Graphs! But you need to be able to read boring graphs as a data scientist!)

Results of Imputation
What do these graphs mean? Blue line = precision outcomes for each trial, red line = accuracy. The title is the median over the 100 samples with the standard deviation following the +/- sign. A quick summary of what we learn from these plots - there is no meaningful statistical difference for us using any method of imputation from no imputation. Using mean Age is more precise and accurate usually, but is prone to extremely poor estimates, making it more unreliable than just removing all NaNs (we have no way of knowing what is a good/bad model run when we're trying to do predictions until after the ship sinks...).

Using Imputation? Doesn't work for us here - there's too much variation in Age and it isn't centered around the mean, median, or most frequent value making imputation not a great choice for our problem. Bummer.

Remember how I said earlier the best way to improve Random Forest models is to add more features? Let's give that a shot. Since we have limited data, we have to work with what we have - this requires creativity. Ready?

1st New column: Let's see if Name Length does actually matter....

df['NameLength']=df['Name'].map(lambda x: len(x))

2nd New Column: Let's see if the ages tend of fall into sets of 10 (so like < 10 = 0, 10-20 = 1, 20-30 = 2, etc.). The meaningful interpretation of this would be that survival success is determined by life stage less than money or specific age.

df['AgeDecade']=df['Age'].map(lambda x: int(x/10.))

3rd New Column: Look at the Name data. Most of the names have some sort of identifier, like "Miss" or "Mr".

0                           Allen, Miss Elisabeth Walton
1                            Allison, Miss Helen Loraine
2                    Allison, Mr Hudson Joshua Creighton
3          Allison, Mrs Hudson JC (Bessie Waldo Daniels)
4                          Allison, Master Hudson Trevor

Now this is a bit tricky. We're going to do some regex, which means we're going to parse the strings but certain identifiers. We can't simply use a search that looks for "Miss", "Mr" or "Ms" because, as you can see from above, there are random ones like Master or Colonel or Madame, ad infinitum. So when we turn to regex, we want to pull the string that has a ",  " at the beginning and one blank space at the end.

char1=', '; char2=' ' # based on how the excel table sets up the names 
df['prefix']=df['Name'].map(lambda mystr: mystr[mystr.find(char1)+1 : \

Looks complicated, but it's making sure we're getting the right second ' ' instead of the first one after the comma. Unfortunately, our work isn't done yet. We can't put strings into the RF model (D'OH) so we need to use label encoder, which assigns unique numbers to each string.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()['prefix'])

But that's not too bad, now our prefixNumber column looks like:

0       15
1       15
2       17
3       18
4       14
5       17
6       15
7       17
8       18
9       17

But wait, why are the numbers so high (15, 16, 17?!) - there shouldn't be that many prefixes. A second look (always trust your instinct!!) reveals that some of the names have no prefix, so the algorithm returns their name.

In [105]: df['prefix'].unique()
array([' Miss', ' Mr', ' Mrs', ' Master', ' Colonel', ' Dr', ' Major',
       ' Captain', ' Madame', ' Sir', ' Lady', ' Jonkheer', ' the', ' Col',
       ' Ms', '', ' Mlle', ' Rev', 'Jacobsohn', ' Ernst', ' Thomas',
       ' Hilda', ' Jenny', ' Oscar', ' Nils', ' Eino', ' Albert', ' W',
       ' Richard', ' Nikolai', ' Simon', ' William', 'Seman'], dtype=object)

However, if you do a count on each of these (import collections.Counter), unfortunately, the prefixes like 'Lady' and 'Madame' which are legitimate prefixes only have 1 counts, like William. We have a choice to exclude the Lady and Madames along with the mistaken names or just leave them. I opted to leave them because at count = 18 mistaken names (total), that's less than 1% of our data. Small potatoes.

So let's do a run with random state = 1 and see what our outcome is (last list is feature ranking again).

accuracy is: 77%
precision is: 0.787878787879
confusion matrix: [[95 14]
 [28 52]]
[('PClass', 0.12466108587253713), ('Age', 0.25900670307675605),
 ('SexCode', 0.1780132637402112), ('NameLength', 0.2806207705520482), 
('AgeDecade', 0.066477109664273837), ('prefixNumber', 0.091221067094173616)]

So we find Name Length matters more than makes any of us comfortable (0.28), prefix Number kind of matters (0.09) but AgeDecade matters even less (0.06). In fact, Name Length is our most important variable (whaaaa?).

So last step in this blog post - removing poorly performing features. We added features, we ranked them, we found the worst (Age Decade). So after all that hard work, let's try removing it.

accuracy is: 79%
precision is: 0.815384615385
confusion matrix: [[97 12]
 [27 53]]
[('PClass', 0.12320539747721723), ('Age', 0.315289385945629), 
('SexCode', 0.15586239021703155), ('NameLength', 0.24434927387294786), 
('prefixNumber', 0.16129355248717442)]

Hey! It got better! What if... we removed PClass?

accuracy is: 76%
precision is: 0.78125
confusion matrix: [[95 14]
 [30 50]]
[('Age', 0.38860506245587562), ('SexCode', 0.1806853871062975), 
('NameLength', 0.3081195913340159), ('prefixNumber', 0.12258995910381092)]

Oh no! It got worse (in this random state case). We're walking a fine line here between we want as many features as possible but we also don't want poor performers dragging our model down both in time and potentially accuracy.

We're still not at 90% accurate though :( Let's call this a blog post and continue working on this next week.

Code is always available on Github:

this is v2 to keep v1 alive for the first post readers.

You Might Also Like