Oct 19, 2021 3 min read

What is Isolation Forest?

Short Answer

Isolation Forest (iForest) is a machine learning algorithm for anomaly detection. Instances, which have an average shorter path length in the trained isolation forest, are classified as anomalous points.

Long Answer

The isolation Forest algorithm is a very effective and intuitive anomaly detection method, which was first proposed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in 2008. (F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Proceedings of the IEEE International Conference on Data Mining, pages 413–422, 2008.)
The philosophy behind iForest is that anomalous data points are few and exotic. That makes them isolated from the normal points.
From a mathematical point of view, recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root.

Algorithm

Let \(\boldsymbol{X} = \{x_1, \ldots, x_n\}\) be a set of \(d\)-dimensional points, and \(\boldsymbol{X}'\) is a subsample from \(\boldsymbol{X}\). An Isolation Tree (iTree) is defined as a data structure with the following properties:
1. For each node \(\boldsymbol{T}\) in the Tree, \(\boldsymbol{T}\) is either an external-node with no child, or an internal-node with one "test" and exactly two daughter nodes (\(\boldsymbol{T}_ {l}\) and \(\boldsymbol{T}_ {r}\)).
2. A test at node \(\boldsymbol{T}\) consists of an attribute \(\boldsymbol{q}\) and a split value \(\boldsymbol{p}\) such that the test \(\boldsymbol{q}<\boldsymbol{p}\) determines the traversal of a data point to either (\(\boldsymbol{T}_ {l}\) or \(\boldsymbol{T}_ {r}\)).
In order to build an iTree, the algorithm recursively divides \(\boldsymbol{X}'\) by randomly selecting an attribute \(\boldsymbol{q}\) and a split value \(\boldsymbol{p}\), until either
1. The node has only one instance, or
2. All data at the node have the same values.
When the iTree is fully grown, each point in \(\boldsymbol{X}\) is isolated at one of the external nodes. Intuitively, the anomalous points are those with the smaller path length in the tree, where the path length \(h(\boldsymbol{x}_ {i})\) of point \(\boldsymbol{x}_ {i} \in \boldsymbol{X}\) is defined as the number of edges \(\boldsymbol{x}_ {i} \) traverses from the root node to get to an external node.

Anomaly Detection with Isolation Forest

Anomaly detection with iForest is a process composed of two main stages:
1. Build iTrees on a training dataset.
2. An anomaly score is calculated for each instance.
3. Determine the anormalous points whose anomaly scores are greater than a predefined threshold.

Anomaly Score

The structure of an iTree is substantially equivalent to that of a binary search tree (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST.
As a consequence, the estimation of average \(h(\boldsymbol{x})\) for external node ternimations is the same as that of the unsuccessful searches in BST, that is

\[c(m)= \begin{cases}2 H(m-1)-\frac{2(m-1)}{n} & \text { for } m>2 \\ 1 & \text { for } m=2 \\ 0 & \text { otherwise }\end{cases}\]

where \(n\) is the testing data size, m is the size of the sample set and \(H\) is the harmonic number, which can be estimated by \(H(i) = ln(i) + \gamma\), where \(\gamma = 0.5772156629\) is the Euler-Mascheroni constant.

The value of \(c(m)\) above represents the average of \(h(\boldsymbol{x})\) from a collection of iTrees. It is interesting to note that for any given instance \(\boldsymbol{x}\):
- If \(s\) is close to 1, then \(\boldsymbol{x}\) is very likely to be an anomaly.
- If \(s\) is smaller than 0.5, then \(\boldsymbol{x}\) is likely to be a normal value.
- If for a given sample all instances are assigned an anomaly score of around 0.5, then it is safe to assume that the sample doesn't have any anomaly.
Python code for iForest:
from sklearn.ensemble import IsolationForest
clf = IsolationForest(random_sate=0).fit(X_train)
clf.predict(X_test)