Scikit-Learn#

Scikit-Learn Logo

Random Forest#

Description#

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Each tree of a random forest is trained on a randomised subset of the original training set.

How does that work#

At each node of a tree, the algo will test every split possible and keep the split which procures the most gain in information. For a split to be done, the algo will take one feature and one threshold and seperates the samples that have a value for this feature below or equal to the threshold from the samples that have a value above the threshold.

To measure the information gain several methods exist, one of those which is often used is the gini impurity score. The gini impurity of a node is calculated with the following formula:

\[i(\tau) = 1 - \rho_{1}^2 - \rho_{0}^2\]

with \(\rho_{k} = \frac{n_{k}}{n}\),

\(k\) = class {0, 1},

\(n\) = sample count,

\(\tau\) = node

The information gain of a split is then calculated like this:

First calculate the gini impurity of the two sub nodes. Then sum the two gini impurity weighted by the number of samples of the corresponding sub node like this:

\[\Delta_{i}(\tau) = \frac{n_{\tau_{left}}}{n_{\tau}} * i(\tau_{left}) + \frac{n_{\tau_{right}}}{n_{\tau}} * i(\tau_{right})\]

For a classifier with only two classes, the gini impurity of a node follow this function:

gini impurity

Simple tree example:

digraph Tree {
node [style="filled, rounded", color="black"] ;
edge [fontname=helvetica] ;
0 [label="τ 0\ngini = 0.5\nsamples = 100\nvalue = [50, 50]", fillcolor="#eef7fd"] ;
1 [label="τ left\ngini = 0.0\nsamples = 20\nvalue = [20, 0]", fillcolor="#eef7fd"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="τ right\ngini = 0.46875\nsamples = 80\nvalue = [30, 50]", fillcolor="#eef7fd"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
}

How to use#

In Scikit-Learn the random forest classifier can be imported like this:

from sklearn.ensemble import RandomForestClassifier

Usage example:

from sklearn.datasets import make_classification
# Get dataset example with 1000 samples and 4 features
X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
# Create a random forest model
randomForest = RandomForestClassifier(max_depth=2, n_estimators=100, criterion="gini" , random_state=0)
# Train the model with dataset
randomForest.fit(X, y)
# Print importances of features and predict new sample
print(randomForest.feature_importances_)
print(randomForest.predict([[0, 0, 0, 0]]))

This will produce the following outputs:

[0.14205973 0.76664038 0.0282433  0.06305659]
[1]

Accuracy Checking#

Checking the accuracy of the model using the train set:

from sklearn.metrics import accuracy_score

# Validate model
y_predict = model.predict(X)
result = accuracy_score(y.values, y_predict)

Trees Visualisation#

Visualising trees from random forest inside jupyter lab:

from sklearn.tree import export_graphviz
from IPython.display import Image

# Extract single tree
estimator = forest.estimators_[0]

# Export as dot file
export_graphviz(estimator, out_file='tree.dot'),
                feature_names=X.columns,
                class_names=["class A", "class B"],  # Need to pass list of class names
                rounded=True, proportion=False,
                precision=2, filled=True)

# Convert to png using system command (requires Graphviz)
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter lab
Image(filename='tree.png')
Example tree from RF

Machine Learning Scikit-Learn Python