Notebook Six | Repository

Nearest Neighbors and Outliers Detection

Andrea Leone
University of Trento
January 2022


NearestCentroid Classifier


A nearest centroid classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation. Euclidean distance is the default metric used to compute the interspace between instances.

score board — NearestCentroid

pipeline         accuracy  precision recall     cm_d

en_core_web_lg   .63661971 .63104837 .62317293  150 204  98
en_core_web_lg   .64603174 .63519903 .63491569  152 167  88  without outliers (pm=LOF)
en_core_web_lg   .67982456 .66279152 .65754219  105 148  57  without outliers (pm=IF)

en_core_web_trf  .41043723 .41442327 .40269239  135  90  66
en_core_web_trf  .40771812 .40518234 .40022292   88 106  49  without outliers (pm=LOF)
en_core_web_trf  .43167701 .42130935 .42467788   54  58  27  without outliers (pm=IF)


K-Neighbors Classifier


The KNN classifier is a non-parametric method that outputs the probability that the input point belongs to a class. The classification is the result of a plurality vote of its k neighbors. Weights and algorithm can be tuned.

score board — KNeighborsClassifier

pipeline         accuracy  precision recall      cm_d

en_core_web_lg   .67887323 .66974877 .66746913   180 198 104
en_core_web_lg   .71587301 .70719417 .70272845   181 174  96  without outliers (pm=LOF)
en_core_web_lg   .72807017 .70811300 .70502846   118 154  60  without outliers (pm=IF)

en_core_web_trf  .53878702 .51926733 .51939140   153 173  56
en_core_web_trf  .50167785 .47172080 .47153926   101 157  41  without outliers (pm=LOF)
en_core_web_trf  .51863354 .49376640 .49577381    65  77  25  without outliers (pm=IF)


Local Outlier Factor Classifier


The LOF algorithm is used as an unsupervised outlier detector.
The anomaly score of each sample is called Local Outlier Factor: it measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.


Isolation Forest Classifier


The IsolationForest algorithm returns the anomaly score for each sample.
It isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.