Colinearity in Random Forests - Does it Matter?

colinearity_in_RF

Colinearity -- why do we care?

Colinearity is when your features match up with each other too well... for example, precipitation with cloudiness or fiber with stool consistency. However, sometimes in machine learning models, colinearity is a bad thing. It can over bias a model towards like features without appropriately weighting towards the not-like features. So for example, in multi-feature regression, this can lead to problems
However, we're going to look at why colinearity doesn't matter for Random Forests. If you want an explanation of mathematically why it doesn't line up, check here: https://stats.stackexchange.com/questions/168622/why-is-multicollinearity-not-checked-in-modern-statistics-machine-learning/168631. We're going to show the evidence for this though -- why doesn't it matter as far as results in a Random Forest Regression model.
Let's get cracking!
In [3]:
# standard import list 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

# read in data, we are using motgage data available from: 
# https://catalog.data.gov/dataset/state-of-new-york-mortgage-agency-sonyma-loans-purchased-beginning-2004
df=pd.read_csv('State_of_New_York_Mortgage_Agency.csv')

Sweet, we have our data

Now let's do some feature fixing. The data file we loaded in has a bunch of text based features (i.e Property type) which we need to modify. We're going to reduce the number of features from 14 -> 12 and do some text replacement.
In [45]:
dfmod=df[['Original Loan Amount', 'Purchase Year', 'Original Loan To Value', 'SONYMA DPAL/CCAL Amount', 'Number of Units', \
'Household Size ', 'Property Type', 'County', 'Housing Type', 'Bond Series', 'Original Term']]

# turn off warnings on the slice operation we do below. 
# This is a unique factorize problem because it returns a tuple, sigh
# https://stackoverflow.com/questions/45080400/dealing-with-pandas-settingwithcopywarning-without-indexer
pd.options.mode.chained_assignment = None 

# factorize changes features from like 'condo' and 'house' to numeric (1, 2, etc.) so our model can handle it
stacked = dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']].stack()
dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']] = pd.Series(stacked.factorize()[0], \
                                                                              index=stacked.index).unstack()

# use regex replace to fix some of the columns that have partial numeric, partial text values
dfmod=dfmod.replace('[\$,]', '', regex=True)
dfmod=dfmod.replace('[\%,]', '', regex=True)
dfmod=dfmod.replace('Family', '', regex=True)

# need to convert to float
dfmod=dfmod.astype(float)

Let's see what the wild data's natural feature correlations are

In [47]:
# test for correlations 
corrDF=dfmod.corr()
corrDF
Out[47]:
Original Loan Amount Purchase Year Original Loan To Value SONYMA DPAL/CCAL Amount Number of Units Household Size Property Type County Housing Type Bond Series Original Term
Original Loan Amount 1.000000 0.337831 -0.056902 0.662054 0.112723 0.238369 0.101085 0.232890 0.124133 0.325947 0.184459
Purchase Year 0.337831 1.000000 -0.152347 -0.062682 -0.005365 0.073343 0.149763 0.105100 0.113273 0.922574 0.058512
Original Loan To Value -0.056902 -0.152347 1.000000 -0.091755 0.028671 -0.033281 -0.294189 -0.189966 -0.294904 -0.167013 -0.002098
SONYMA DPAL/CCAL Amount 0.662054 -0.062682 -0.091755 1.000000 0.050282 0.202176 0.025931 0.160047 0.152185 -0.078872 0.199689
Number of Units 0.112723 -0.005365 0.028671 0.050282 1.000000 -0.004223 -0.003898 -0.027416 0.013539 -0.006489 -0.002910
Household Size 0.238369 0.073343 -0.033281 0.202176 -0.004223 1.000000 -0.043792 0.107912 0.085088 0.073733 0.076269
Property Type 0.101085 0.149763 -0.294189 0.025931 -0.003898 -0.043792 1.000000 0.197686 0.224163 0.158472 0.030414
County 0.232890 0.105100 -0.189966 0.160047 -0.027416 0.107912 0.197686 1.000000 0.174262 0.115525 0.050321
Housing Type 0.124133 0.113273 -0.294904 0.152185 0.013539 0.085088 0.224163 0.174262 1.000000 0.134284 0.046703
Bond Series 0.325947 0.922574 -0.167013 -0.078872 -0.006489 0.073733 0.158472 0.115525 0.134284 1.000000 0.054699
Original Term 0.184459 0.058512 -0.002098 0.199689 -0.002910 0.076269 0.030414 0.050321 0.046703 0.054699 1.000000
The natural strongest correlation lies between 'Original Loan Amount' and 'SONYMA DPAL/CCAL Amount' at 0.66. This actually isn't that strong of a correlation

So let's make some fake data that strongly correlates

we're going to create 'Grandmas Loan Agency' which will have an extremely strong correlation with 'SONYMA DPAL/CCAL Amount', a 0.993 correlation to be precise!
In [48]:
randoms = np.linspace(0.9, 1.1, len(dfmod))
dfmod['Grandmas Loan Agency']=dfmod['SONYMA DPAL/CCAL Amount']*randoms
corrDF=dfmod.corr()
corrDF
Out[48]:
Original Loan Amount Purchase Year Original Loan To Value SONYMA DPAL/CCAL Amount Number of Units Household Size Property Type County Housing Type Bond Series Original Term Grandmas Loan Agency
Original Loan Amount 1.000000 0.337831 -0.056902 0.662054 0.112723 0.238369 0.101085 0.232890 0.124133 0.325947 0.184459 0.683658
Purchase Year 0.337831 1.000000 -0.152347 -0.062682 -0.005365 0.073343 0.149763 0.105100 0.113273 0.922574 0.058512 0.030616
Original Loan To Value -0.056902 -0.152347 1.000000 -0.091755 0.028671 -0.033281 -0.294189 -0.189966 -0.294904 -0.167013 -0.002098 -0.105809
SONYMA DPAL/CCAL Amount 0.662054 -0.062682 -0.091755 1.000000 0.050282 0.202176 0.025931 0.160047 0.152185 -0.078872 0.199689 0.993475
Number of Units 0.112723 -0.005365 0.028671 0.050282 1.000000 -0.004223 -0.003898 -0.027416 0.013539 -0.006489 -0.002910 0.048050
Household Size 0.238369 0.073343 -0.033281 0.202176 -0.004223 1.000000 -0.043792 0.107912 0.085088 0.073733 0.076269 0.207818
Property Type 0.101085 0.149763 -0.294189 0.025931 -0.003898 -0.043792 1.000000 0.197686 0.224163 0.158472 0.030414 0.038477
County 0.232890 0.105100 -0.189966 0.160047 -0.027416 0.107912 0.197686 1.000000 0.174262 0.115525 0.050321 0.167676
Housing Type 0.124133 0.113273 -0.294904 0.152185 0.013539 0.085088 0.224163 0.174262 1.000000 0.134284 0.046703 0.167975
Bond Series 0.325947 0.922574 -0.167013 -0.078872 -0.006489 0.073733 0.158472 0.115525 0.134284 1.000000 0.054699 0.009896
Original Term 0.184459 0.058512 -0.002098 0.199689 -0.002910 0.076269 0.030414 0.050321 0.046703 0.054699 1.000000 0.207590
Grandmas Loan Agency 0.683658 0.030616 -0.105809 0.993475 0.048050 0.207818 0.038477 0.167676 0.167975 0.009896 0.207590 1.000000

Let's build a random forest model now

make a function for ease of use
In [97]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split

def ourModel(data, result):  
    # inputs
    # data = pandas data frame (x)
    # results = column of desired result (y)
    
    
    # split the test - train set 
    X_train, X_test, y_train, y_test = train_test_split(
    data , result , test_size=0.25, random_state=1)

    # setup the model
    clf = RandomForestRegressor(n_estimators=100, n_jobs=4, oob_score =True)
    clf.fit(X_train, y_train)
    predictions=clf.predict(X_test)
    print('r2: ' + str(metrics.r2_score(predictions, y_test)))
    print('mse: '+ str(metrics.mean_squared_error(predictions, y_test)))
    
    # feature importance
    importances=clf.feature_importances_
    indices = np.argsort(importances)
    fp=zip(data.columns.values[indices], importances[indices])
    
    return(fp)
    

Run the model

one for the non-fake data and one for the fake data included set we are going to use r^2 as our scoring metric. If we were being more precise we could use a variation of MSE or percent accuracy. Regression can be tough to score, but r^2 is a good high level scoring metric for us.
In [98]:
dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)

print('WITH MADE UP DATA')
fp_fake=ourModel(data_fake, result)

data_nofake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA")
fp_nofake=ourModel(data_nofake, result)
WITH MADE UP DATA
r2: 0.878783708526
mse: 646230597.915

WITHOUT MADE UP DATA
r2: 0.878950601915
mse: 642225808.702
Wow, almost identical. Using our not made up data, we do marginally better. Whew.

Just for fun, let's engineer a few features and see if we can improve our model a bit more.

In [99]:
# let's add some financial data
# our purchase years range from 2004 to 2016. Let's get the average mortage rate in those years
# googled and found at : http://www.freddiemac.com/pmms/pmms30.html
years=np.linspace(2004., 2016., 13)
mort30=np.array([5.84, 5.87, 6.41, 6.34, 6.03, 5.04, 4.69, 4.45, 3.66, 3.98, 4.17, 3.85, 3.65])
dfmod['mort']=[mort30[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]

#
# what else can we add? Maybe how many houses were purchased in that year 
# from https://www.statista.com/statistics/219963/number-of-us-house-sales/
housesBought=np.array([1203, 1283, 1051, 776, 485, 375, 323, 306, 368, 429, 437, 501, 560])*1000.
dfmod['housesBought']=[housesBought[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]

#  and just because we don't want all our new data depending on year, let's do one about
#  expected wealth by family size in NY
#  source: https://www.justice.gov/ust/eo/bapcpa/20130501/bci_data/median_income_table.htm
#  assume > 4 is = 4

# let's make a "wealthy vs Poor" category
dfmod['income']=[0 if dfmod['Household Size '].iloc[x] < 3 or dfmod['Household Size '].iloc[x]\
    > 4 else 1 for x in range(len(dfmod['Household Size '])) ]

#
# run the model again
dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)

print('WITH MADE UP DATA, Round 2!')
fp_fake=ourModel(data_fake, result)

data_fake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA, Round 2!")
fp_nofake=ourModel(data_nofake, result)

# feature ranking for importances
print('\nFeatures in order of importance for fake:')
print(list(fp_fake))

print('\nFeatures in order of importance for NOT fake:')
print(list(fp_nofake))
WITH MADE UP DATA, Round 2!
r2: 0.878659195663
mse: 646008692.669

WITHOUT MADE UP DATA, Round 2!
r2: 0.879640201002
mse: 638372654.494

Features in order of importance for fake:
[('Original Term', 0.0012119048726738403), ('income', 0.0018557408905802973), ('housesBought', 0.0034072650332360498), ('Number of Units', 0.0045150274104016012), ('Housing Type', 0.005621309275188436), ('Household Size ', 0.0084531725849136714), ('Purchase Year', 0.010176507998518845), ('Bond Series', 0.017687347870597322), ('mort', 0.017756245823941391), ('Property Type', 0.021857603293443457), ('Original Loan To Value', 0.060640951153870012), ('County', 0.060833137290442242), ('SONYMA DPAL/CCAL Amount', 0.08119494524013042), ('Grandmas Loan Agency', 0.7047888412620622)]

Features in order of importance for NOT fake:
[('Original Term', 0.0012515854598453282), ('income', 0.002121745253540943), ('Number of Units', 0.0050243316634641629), ('housesBought', 0.0052015698975304784), ('Housing Type', 0.0060353754419427055), ('Household Size ', 0.0087523972898092498), ('Purchase Year', 0.018687222981189733), ('mort', 0.021287665266029609), ('Bond Series', 0.021917663768452722), ('Property Type', 0.022181409510731089), ('Original Loan To Value', 0.060030462895934916), ('County', 0.063047889030636628), ('SONYMA DPAL/CCAL Amount', 0.7644606815408922)]

The best model overall: Not made up data with additional features (lowest MSE)

Why? Because it combines more features, some relevant (i.e. mortgage interest rate) without data based in reality.
However, notice that between the fake and not-fake data, the heaviest weight it still on Grandma's Loan Agency and SONYMA. Even though it splits the weight in the fake data inclusion, it still is heavily basing it on one feature.
What does this mean?

Random Forest takes into account colinearity in tree splitting -- it's trying to get us the best result

Summary

  • We found that our best model was one that used many features and didn't have fake data
  • Colinearity didn't matter - the RF model was making the splits based on what worked and not necessarily trying to balance features
  • Always check to be sure!!
Clone this notebook at: https://github.com/loisks317/TheDataVolcano/tree/master/Colinearity to play with it more yourself. Try different random states and engineering your own features to see if you can do better!

Lo

No comments:

Instagram