Random Forest — Method and Application (Python)

Advantage of RF:

Only little time is needed for optimization (the default param are good enough)
Strong with outliers, correlated variables
For continuous variables, it’s able to segmentize it

Method:

Create a bootstrapped dataset (Sample with replacement)
Create a decision tree using the bootstrapped dataset
But only use a random subset of variables at each split
i.e. in each split, randomly consider a subset left-over variables
that are not selected by the previous split
Repeat above step to have 100 tree
Prediction:
classify a observation to a class that has the most vote by 100 tree result
How to validate? OOB out-of-bag
i. Run obs in OOB to see if trees classify it right or not, majority vote wins
ii. The proportion of OOB samples that were incorrectly classified is the OOB error
Now since we know how to validate RF, We can use this to choose number of variable to consider in step 2. Test using different number of variables, and compare OOB error
Note: typically start by using sqrt(num of var) and try few above or below that value.

Step for RF Application:

a. Build Random Forest:
1. create dummy for categorical var
2. split into test and train
3. train model
4. get accuracy and cofusion matrix
5. checking overfitting
  - (by comparing OOB error rate bt test and train)
6. if OOB error and test error are similar, not overfitting
7. (for Fraud) build ROC and look for possible cut-off point
  - max(1-class1 error(FP) – class0 error(FN))
b. Plot variable importance for insights of each variable
1. Plot PDP (Partial Dependence Plots)
2. for insights of each levels for each variable

a. Build Random Forest:

# useful packages
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from numpy.c·ore.umath_tests import inner1d
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

#dummy variables for the categorical ones
data_dummy = pd.get_dummies(data, drop_first=True)
np.random.seed(1234)
  
#split into train and test to avoid overfitting
train, test = train_test_split(data_dummy , test_size = 0.34)
  
#build the model
rf = RandomForestClassifier(n_estimators=100, max_features=3, oob_score=True)
rf.fit(train.drop('outcome_column', axis=1), train['outcome_column'])
  
#let's print OOB accuracy and confusion matrix
print("OOB accuracy is", rf.oob_score_, "\n", 
"OOB Confusion Matrix", "\n",
pd.DataFrame(confusion_matrix(train['converted'], rf.oob_decision_function_[:,1].round(), labels=[0, 1])))

#and let's print test accuracy and confusion matrix
print(
"Test accuracy is", rf.score(test.drop('outcome_column', axis=1),
                             test['outcome_column']), 
"\n", 
"Test Set Confusion Matrix", 
"\n",
pd.DataFrame(confusion_matrix(test['outcome_column'],
                              rf.predict(test.drop('outcome_column', axis=1)),
                              labels=[0, 1]))
)

Since Accuracy for Test and Train are similar, we are not over-fitting

Question 1 : If response variable is continuous, how to define accuracy?

Answer: Change accuracy standard

If the prediction is within 25% of the actual value, we say it’s predicting right
I.e. if a given person salary is 100K,
we consider the model to predict correctly if the prediction is within 25K.
We can look at this as a sort of accuracy when the label is continuous.

accuracy_25pct =  ((rf.predict(test.drop('outcome_column',
                                         axis=1))/test['outcome_column']-1).abs()<.25).mean()
print("We are within 25% of the actual outcome in ",
      accuracy_25pct.round(2)*100, "% of the cases", sep="")

Question 2: How to know the model is actually learning things

Answer: (if so, insights generated from RF will be fairly reliable, and for sure directionally true)

#deciles
print(np.percentile(train['outcome_column'], np.arange(0, 100, 10)))

b. Variable Importance (check each variable)

Note:

Var Imp is useful to determine whether rebuild is needed.
is the most important var the least actionable? (i.e. total page visited by user)
- if so, drop that variable from data and refit RF
continuous variable tends to be very important in RF, if categorical varaible stands on the top of important list, means it’s a really important variable
If two variable is likely correlated, check with pearson correlation. RF not going to pick same information twice, thus robust to correlated variable, that’s why it’s so popular.

# Var Imp
# rf is the random forest model we previously built
feat_importances = pd.Series(rf.feature_importances_,
                             index=train.drop('outcome_column', axis=1).columns)
feat_importances.sort_values().plot(kind='barh')

# PDP for column1, which has 3 levels ('level1', 'level2', 'level3')
from pdpbox import pdp, info_plots
pdp_iso = pdp.pdp_isolate( model=rf, 
                          dataset=train.drop(['outcome_column'], axis=1),      
                          model_features=list(train.drop(['outcome_column'], axis=1)), 
                          feature=['level1', 'level2', 'level3'], 
                          num_grid_points=50)
pdp_dataset = pd.Series(pdp_iso.pdp, index=pdp_iso.display_columns)
pdp_dataset.sort_values(ascending=False).plot(kind='bar', title='column1')

# Pearson Corr
from scipy.stats import pearsonr
print("Correlation between A and B is:", 
      round(pearsonr(data.A, data.B)[0],2))