What is Isolation Forest?

Short Answer

  • Isolation Forest (iForest) is a machine learning algorithm for anomaly detection. Instances, which have an average shorter path length in the trained isolation forest, are classified as anomalous points.

Long Answer

  • The isolation Forest algorithm is a very effective and intuitive anomaly detection method, which was first proposed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in 2008. (F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Proceedings of the IEEE International Conference on Data Mining, pages 413–422, 2008.)

  • The philosophy behind iForest is that anomalous data points are few and exotic. That makes them isolated from the normal points.

  • From a mathematical point of view, recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root.

Algorithm
  • Let \(\boldsymbol{X} = \{x_1, \ldots, x_n\}\) be a set of \(d\)-dimensional points, and \(\boldsymbol{X}'\) is a subsample from \(\boldsymbol{X}\). An Isolation Tree (iTree) is defined as a data structure with the following properties:

    1. For each node \(\boldsymbol{T}\) in the Tree, \(\boldsymbol{T}\) is either an external-node with no child, or an internal-node with one "test" and exactly two daughter nodes (\(\boldsymbol{T}_ {l}\) and \(\boldsymbol{T}_ {r}\)).
    2. A test at node \(\boldsymbol{T}\) consists of an attribute \(\boldsymbol{q}\) and a split value \(\boldsymbol{p}\) such that the test \(\boldsymbol{q}<\boldsymbol{p}\) determines the traversal of a data point to either (\(\boldsymbol{T}_ {l}\) or \(\boldsymbol{T}_ {r}\)).
  • In order to build an iTree, the algorithm recursively divides \(\boldsymbol{X}'\) by randomly selecting an attribute \(\boldsymbol{q}\) and a split value \(\boldsymbol{p}\), until either

    1. The node has only one instance, or
    2. All data at the node have the same values.
  • When the iTree is fully grown, each point in \(\boldsymbol{X}\) is isolated at one of the external nodes. Intuitively, the anomalous points are those with the smaller path length in the tree, where the path length \(h(\boldsymbol{x}_ {i})\) of point \(\boldsymbol{x}_ {i} \in \boldsymbol{X}\) is defined as the number of edges \(\boldsymbol{x}_ {i} \) traverses from the root node to get to an external node.

Anomaly Detection with Isolation Forest
  • Anomaly detection with iForest is a process composed of two main stages:
    1. Build iTrees on a training dataset.
    2. An anomaly score is calculated for each instance.
    3. Determine the anormalous points whose anomaly scores are greater than a predefined threshold.
Anomaly Score
  • The structure of an iTree is substantially equivalent to that of a binary search tree (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST.

  • As a consequence, the estimation of average \(h(\boldsymbol{x})\) for external node ternimations is the same as that of the unsuccessful searches in BST, that is

\[c(m)= \begin{cases}2 H(m-1)-\frac{2(m-1)}{n} & \text { for } m>2 \\ 1 & \text { for } m=2 \\ 0 & \text { otherwise }\end{cases}\]

where \(n\) is the testing data size, m is the size of the sample set and \(H\) is the harmonic number, which can be estimated by \(H(i) = ln(i) + \gamma\), where \(\gamma = 0.5772156629\) is the Euler-Mascheroni constant.

  • The value of \(c(m)\) above represents the average of \(h(\boldsymbol{x})\) from a collection of iTrees. It is interesting to note that for any given instance \(\boldsymbol{x}\):

    • If \(s\) is close to 1, then \(\boldsymbol{x}\) is very likely to be an anomaly.
    • If \(s\) is smaller than 0.5, then \(\boldsymbol{x}\) is likely to be a normal value.
    • If for a given sample all instances are assigned an anomaly score of around 0.5, then it is safe to assume that the sample doesn't have any anomaly.
  • Python code for iForest:
    from sklearn.ensemble import IsolationForest
    clf = IsolationForest(random_sate=0).fit(X_train)
    clf.predict(X_test)