Colinearity in Random Forests - Does it Matter?
Colinearity -- why do we care?¶
Colinearity is when your features match up with each other too well... for example, precipitation with cloudiness or fiber with stool consistency. However, sometimes in machine learning models, colinearity is a bad thing. It can over bias a model towards like features without appropriately weighting towards the not-like features. So for example, in multi-feature regression, this can lead to problemsHowever, we're going to look at why colinearity doesn't matter for Random Forests. If you want an explanation of mathematically why it doesn't line up, check here: https://stats.stackexchange.com/questions/168622/why-is-multicollinearity-not-checked-in-modern-statistics-machine-learning/168631. We're going to show the evidence for this though -- why doesn't it matter as far as results in a Random Forest Regression model.
Let's get cracking!
In [3]:
# standard import list
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
# read in data, we are using motgage data available from:
# https://catalog.data.gov/dataset/state-of-new-york-mortgage-agency-sonyma-loans-purchased-beginning-2004
df=pd.read_csv('State_of_New_York_Mortgage_Agency.csv')
Sweet, we have our data¶
Now let's do some feature fixing. The data file we loaded in has a bunch of text based features (i.e Property type) which we need to modify. We're going to reduce the number of features from 14 -> 12 and do some text replacement.
In [45]:
dfmod=df[['Original Loan Amount', 'Purchase Year', 'Original Loan To Value', 'SONYMA DPAL/CCAL Amount', 'Number of Units', \
'Household Size ', 'Property Type', 'County', 'Housing Type', 'Bond Series', 'Original Term']]
# turn off warnings on the slice operation we do below.
# This is a unique factorize problem because it returns a tuple, sigh
# https://stackoverflow.com/questions/45080400/dealing-with-pandas-settingwithcopywarning-without-indexer
pd.options.mode.chained_assignment = None
# factorize changes features from like 'condo' and 'house' to numeric (1, 2, etc.) so our model can handle it
stacked = dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']].stack()
dfmod[['Property Type', 'County', 'Housing Type', 'Bond Series']] = pd.Series(stacked.factorize()[0], \
index=stacked.index).unstack()
# use regex replace to fix some of the columns that have partial numeric, partial text values
dfmod=dfmod.replace('[\$,]', '', regex=True)
dfmod=dfmod.replace('[\%,]', '', regex=True)
dfmod=dfmod.replace('Family', '', regex=True)
# need to convert to float
dfmod=dfmod.astype(float)
Let's see what the wild data's natural feature correlations are¶
In [47]:
# test for correlations
corrDF=dfmod.corr()
corrDF
Out[47]:
The natural strongest correlation lies between 'Original Loan Amount' and 'SONYMA DPAL/CCAL Amount' at 0.66. This actually isn't that strong of a correlation
So let's make some fake data that strongly correlates¶
we're going to create 'Grandmas Loan Agency' which will have an extremely strong correlation with 'SONYMA DPAL/CCAL Amount', a 0.993 correlation to be precise!
In [48]:
randoms = np.linspace(0.9, 1.1, len(dfmod))
dfmod['Grandmas Loan Agency']=dfmod['SONYMA DPAL/CCAL Amount']*randoms
corrDF=dfmod.corr()
corrDF
Out[48]:
Let's build a random forest model now¶
make a function for ease of use
In [97]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split
def ourModel(data, result):
# inputs
# data = pandas data frame (x)
# results = column of desired result (y)
# split the test - train set
X_train, X_test, y_train, y_test = train_test_split(
data , result , test_size=0.25, random_state=1)
# setup the model
clf = RandomForestRegressor(n_estimators=100, n_jobs=4, oob_score =True)
clf.fit(X_train, y_train)
predictions=clf.predict(X_test)
print('r2: ' + str(metrics.r2_score(predictions, y_test)))
print('mse: '+ str(metrics.mean_squared_error(predictions, y_test)))
# feature importance
importances=clf.feature_importances_
indices = np.argsort(importances)
fp=zip(data.columns.values[indices], importances[indices])
return(fp)
Run the model¶
one for the non-fake data and one for the fake data included set we are going to use r^2 as our scoring metric. If we were being more precise we could use a variation of MSE or percent accuracy. Regression can be tough to score, but r^2 is a good high level scoring metric for us.
In [98]:
dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)
print('WITH MADE UP DATA')
fp_fake=ourModel(data_fake, result)
data_nofake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA")
fp_nofake=ourModel(data_nofake, result)
Wow, almost identical. Using our not made up data, we do marginally better. Whew.
Just for fun, let's engineer a few features and see if we can improve our model a bit more.¶
In [99]:
# let's add some financial data
# our purchase years range from 2004 to 2016. Let's get the average mortage rate in those years
# googled and found at : http://www.freddiemac.com/pmms/pmms30.html
years=np.linspace(2004., 2016., 13)
mort30=np.array([5.84, 5.87, 6.41, 6.34, 6.03, 5.04, 4.69, 4.45, 3.66, 3.98, 4.17, 3.85, 3.65])
dfmod['mort']=[mort30[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]
#
# what else can we add? Maybe how many houses were purchased in that year
# from https://www.statista.com/statistics/219963/number-of-us-house-sales/
housesBought=np.array([1203, 1283, 1051, 776, 485, 375, 323, 306, 368, 429, 437, 501, 560])*1000.
dfmod['housesBought']=[housesBought[years==dfmod['Purchase Year'].iloc[item]][0] for item in range(len(dfmod['Purchase Year']))]
# and just because we don't want all our new data depending on year, let's do one about
# expected wealth by family size in NY
# source: https://www.justice.gov/ust/eo/bapcpa/20130501/bci_data/median_income_table.htm
# assume > 4 is = 4
# let's make a "wealthy vs Poor" category
dfmod['income']=[0 if dfmod['Household Size '].iloc[x] < 3 or dfmod['Household Size '].iloc[x]\
> 4 else 1 for x in range(len(dfmod['Household Size '])) ]
#
# run the model again
dfmod=dfmod.dropna()
result=dfmod['Original Loan Amount']
data_fake=dfmod.drop(['Original Loan Amount'], axis=1)
print('WITH MADE UP DATA, Round 2!')
fp_fake=ourModel(data_fake, result)
data_fake=dfmod.drop(['Original Loan Amount', 'Grandmas Loan Agency'], axis=1)
print("\nWITHOUT MADE UP DATA, Round 2!")
fp_nofake=ourModel(data_nofake, result)
# feature ranking for importances
print('\nFeatures in order of importance for fake:')
print(list(fp_fake))
print('\nFeatures in order of importance for NOT fake:')
print(list(fp_nofake))
The best model overall: Not made up data with additional features (lowest MSE)¶
Why? Because it combines more features, some relevant (i.e. mortgage interest rate) without data based in reality.However, notice that between the fake and not-fake data, the heaviest weight it still on Grandma's Loan Agency and SONYMA. Even though it splits the weight in the fake data inclusion, it still is heavily basing it on one feature.
What does this mean?
Random Forest takes into account colinearity in tree splitting -- it's trying to get us the best result¶
Summary¶
- We found that our best model was one that used many features and didn't have fake data
- Colinearity didn't matter - the RF model was making the splits based on what worked and not necessarily trying to balance features
- Always check to be sure!!
0 Comments