Look at the rank of important variables, if the top one are the least actionable variable, meaning that it’s impossible for company to change that variable, delete it and re-build RF
check whether the top variable are continuous or categorical variable
continuous variables tend to show up at the top in RF variable importance plots.
If a categorical variables shows up at the top, it usually means it’s really important
2. PDP plot:
For categorical features with multiple levels:
Always remember there is a base level that was not plotted here
If all level are high positive, that means all those levels have high values compare to the base level, which means the base level has lowest outcome value
For binary features:
The plot is usually straight forward
For continuous:
Check the trend, and make a division
i.e. people with more than 70k income (feature) tends to have higher success rate (outcome value)
3. Build a simple DT to check 2 or 3 important segments
import graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source
tree = DecisionTreeClassifier( max_depth=2,class_weight={0:1, 1:10}, min_impurity_decrease = 0.001)
tree.fit(train.drop(['outcome'], axis=1), train['outcome'])
#visualize it
export_graphviz(tree, out_file="tree_conversion.dot", feature_names=train.drop(['outcome'], axis=1).columns, proportion=True, rotate=True)
with open("decision tree.dot") as f:
dot_graph = f.read()
s = Source.from_file("decision tree.dot")
s.view()
Each nodes has 4 values:
The tree split
Gini index:
Represent impurity of the node, 0.5 the worst
The average weighted Gini Impurity decreases as we move down the tree
0 means perfect classification, best possible value
It’s the probability that randomly chosen sample in a node would be incorrectly labeled if it was labeled by the distribution of samples in the node.
Samples:
Proportion of events in that node, the higher the better
It means that node is very important capture many people
Value:
Proportion of class 0 and class 1 event
If a variable is in the tree throughout all levels,
it probably have information about the other features as well
Note: it’s often that one variable is way more important than the others, this happens because it’s highly correlated with the other variables, try to get to the bottom to those relationship between the most important var and the others. Or try to remove that feature and see which variable starts to matter.
Plot important feature vs outcome, to investigate pattern of how outcome was influenced by that feature