Recall that in unsupervised learning we don’t have a “teacher” \(Y\), i.e., we don’t have a target variable that we want to model/predict. We have already learnt one of the most important unsupervised data mining tasks, clustering, where we are interested to find groups (clusters) of observations that are more similar or related to each other than observations in other groups. In the following we will discuss two different unsupervised learning tasks that are also very important in the realm of data mining: Outlier Detection and Principal Component Analysis (PCA).
We have learnt that in clustering the analyst is looking for patterns that emerge as clusters of observations that are somehow more similar or related to each other than to observations in other clusters. Since clusters are formed by a collection of similar observations, they can be interpreted as a usual pattern in the data. However, not all observations fit well these patterns. Observations that largely deviate from other observations and do not fit well any cluster in the data are called outliers. Unlike clusters, outliers are interpreted as unusual patterns. Detecting these unusual patterns is important in many applications, because outliers are candidates to be anomalies. Unlike supervised anomaly detection, in which the type of anomaly to be detected is previously known, unsupervised outlier detection targets outliers as potentially unknown anomalies yet to be discovered. For instance, they can be a new type of fraud or network intrusion not identified before, so there are no labeled observations to train a classifier.
In the scientific literature, the term outlier is defined in slightly different ways, but with the same meaning, e.g.:
“An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data” (Barnett and Lewis 1994)
“An observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins 1980)
The intuition is that “normal” observations follow a “generating mechanism”, e.g. some given statistical process, while “abnormal observations” deviate from this generating mechanism. From a statistical perspective, the general idea is that the normal generating mechanism follows a certain kind of statistical distribution, which can be fit to the data assuming that all data observations have been generated by such a statistical distribution, and the outliers will be detected as the observations that have a low probability of having been generated by the resulting distribution.
Parametric statistical approaches for unsupervised outlier detection fit parametric distributions by estimating the parameters of these distributions from the given data. For instance, assuming that the normal observations follow a univariate Gaussian distribution, the parameters (mean and standard deviation) of this distribution can be estimated from the collection of observations and those observations lying on the tail of the distribution will be deemed outliers. A very common rule of thumb, known as the “3\(\sigma\) rule”, is that observations deviating more than three times the standard deviation from the mean of a normal distribution may be considered outliers.
It is worth noticing that, in the unsupervised setting, when an observation is detected as an outlier because it deviates from the usual pattern, it doesn’t necessarily mean that the outlier has not been generated by the same mechanism as the other observations. It only means that it is unlikely that this outlier has been generated by the same mechanism, or in other words, it is likely that it has been generated by a different mechanism; “how likely” is a measure or score of outlyingness. Observations with high outlier scores are not necessarily anomalies (frauds, computer intrusions, genetic mutations, etc.), they are suspicious though and, therefore, they may deserve further investigation.
The parametric statistical approach for outlier detection mentioned above, although intuitive, has a number of limitations. Most noticeably: (i) methods following this approach assume a certain distribution (e.g. Gaussian), and this assumption may be violated in practical applications; (ii) the parameters of the distribution will be affected by the outliers themselves, including any genuine anomalies that cannot be removed prior to parameter fitting as they are not known in advance (there are no labels); and (iii) methods following this approach are usually limited to problems with a few variables (low dimensional data sets), often one or two only.
For this reason, in the following we will focus on modern, non-parametric methods for outlier detection that can handle large and high-dimensional databases without making any assumption about the generating data distribution.
KNN Outlier is a very simple and often effective method that basically assigns the distance of an observation to its \(k\)th nearest neighbour as a degree of outlyingness of that observation. The intuition is that observations inside clusters (so-called inliers) will have small distances to their nearest neighbours, therefore a low score of outlyingness, while observations that do not fall within clusters will have larger distances to their nearest neighbours, therefore a high score of outlyingness. As an example, let’s consider the data set in Fig. 1, where there are four clusters of varied densities. Notice that some observations have been generated in the tail of the corresponding cluster distribution and, therefore, they do not fit well the clusters. They are outliers.
library(ggplot2)
set.seed(0)
x11 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 1 (x1 coordinate)
x21 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 1 (x2 coordinate)
x12 <- rnorm(n = 100, mean = 20, sd = 1) # Cluster 2 (x1 coordinate)
x22 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 2 (x2 coordinate)
x13 <- rnorm(n = 100, mean = 15, sd = 3) # Cluster 3 (x1 coordinate)
x23 <- rnorm(n = 100, mean = 25, sd = 3) # Cluster 3 (x2 coordinate)
x14 <- rnorm(n = 50, mean = 25, sd = 1) # Cluster 4 (x1 coordinate)
x24 <- rnorm(n = 50, mean = 25, sd = 1) # Cluster 4 (x2 coordinate)
dat <- data.frame(x1 = c(x11,x12,x13,x14), x2 = c(x21,x22,x23,x24))
( g0a <- ggplot() + geom_point(data=dat, mapping=aes(x=x1, y=x2), shape = 19) )
Fig. 1. Illustrative data set: 4 clusters following multi-normal distributions
The KNN Outlier scores can be easily and efficiently computed using, for example, function kNNdist() from package dbscan. For every observation in the data set, this function returns the distance of the observation to its 1st, 2nd, …, and \(k\)th nearest neighbours. For instance, kNNdist(x=dat, k = 4) will return a matrix with \(k = 4\) columns and the same number of rows as the data set dat. Each row corresponds to an observation and the \(l\)th column contains the distance of that observation to its \(l\)th nearest neighbour. Therefore, the piece of code kNNdist(x=dat, k = 4)[i,] returns the distances of the \(i\)th observation to its 1st, 2nd, …, and \(k\)th nearest neighbours, whereas kNNdist(x=dat, k = 4)[,k] returns the distance of every observation to its \(k\)th nearest neighbour, which are the KNN Outlier scores:
library(dbscan)
k <- 4 # KNN parameter
KNN_Outlier <- kNNdist(x=dat, k = k)[,k] # KNN distance (outlier score) computation
The following code sorts the observations according to their KNN Outlier scores and displays the top 20 outliers along with their scores:
top_n <- 20 # No. of top outliers to be displayed
rank_KNN_Outlier <- order(x=KNN_Outlier, decreasing = TRUE) # Sorting (descending)
KNN_Result <- data.frame(ID = rank_KNN_Outlier, score = KNN_Outlier[rank_KNN_Outlier])
head(KNN_Result, top_n)
## ID score
## 1 237 4.423726
## 2 208 4.177243
## 3 228 3.411198
## 4 240 2.709398
## 5 218 2.518054
## 6 243 2.464709
## 7 262 2.380678
## 8 251 2.303902
## 9 299 2.127866
## 10 203 2.092628
## 11 233 2.079155
## 12 281 2.078306
## 13 229 2.071261
## 14 191 2.023404
## 15 267 2.012530
## 16 231 1.967966
## 17 291 1.956565
## 18 256 1.845032
## 19 248 1.832962
## 20 300 1.831324
We can see that the most outlying observation is observation 237, with outlier score = 4.42, followed by observation 208 with outlier score = 4.17, so on and so forth. Since this data set is 2D, we can visualise the top 20 outliers (highlighted in red) alongside with their IDs in Fig. 2.
g <- g0a +
geom_point(data=dat[rank_KNN_Outlier[1:top_n],], mapping=aes(x=x1,y=x2), shape=19, color="red", size=2) +
geom_text(data=dat[rank_KNN_Outlier[1:top_n],],
mapping=aes(x=(x1-0.5), y=x2, label=rank_KNN_Outlier[1:top_n]), size=2.5)
g
Fig. 2. KNN Outlier (\(k = 4\)): red points are the top 20 outliers, with their IDs
Fig. 2 shows that the most obvious outlying observations from a visual perspective (observations 237 and 208, called global outliers) have indeed been detected as the top 2, and most of the others in the top 20 list do not lie within clusters, they are instead in the vicinity of clusters (the so-called local outliers). This is in conformity with our prior expectations. However, notice that: (a) at least two of the top 20 outliers clearly fall within a cluster (observations 300 and 256), which means that they are actually inliers (false positives); and (b) except for observation 191, all the local outliers ranked within the top 20 are in the vicinity of the large, sparse cluster. Local outliers in the vicinity of other clusters have been missed in the top 20 list. The reason is that KNN Outlier is a global method, in the sense that the degree of outlyingness is computed taking the whole data set as reference. As a consequence, since the large cluster is sparser than the others, its inliers may actually have an outlier score that is larger than those of the local outliers in the vicinity of the smaller, denser clusters. This can be visualised in Fig. 3, where a circle centered at each observation has been drawn with radius proportional to the outlier score of the corresponding observation.
g0b <- ggplot() + geom_point(data=dat, mapping=aes(x=x1, y=x2), shape = 19, size = 0.1)
g2 <- g0b +
geom_point(data=dat, mapping=aes(x=x1, y=x2, size = KNN_Outlier), shape = 1, color = "red") +
scale_size_continuous(range = c(0.1, 20))
g2
Fig. 3. KNN Outlier (\(k = 4\)): circle radii are proportional to outlier scores
Fig. 3 shows that, as a global outlier detection method, KNN may have difficulty detecting local outliers when there are clusters of varied densities in the data.
Exercise 1: There is a variant of the KNN Outlier method, so-called Weighted KNN Outlier, which computes the outlier score of an observation as the average distance from the observation to its \(k\) nearest neighbours, rather than the distance to its \(k\)th nearest neighbour. This way, the outlier score of an observation depends on \(k\) other observations, rather than only one. Using function KNNdist(), modify the codes above to compute the Weighted KNN outlier scores for the same data set in Fig 1, and reproduce Figures 2 and 3 from these scores.
Exercise 2: Consider the Page Blocks data set, which is available at the data repository from the University of California in Irvine — the UCI Data Repository. Originally, this is a classification data set, where “the problem consists of classifying all the blocks of the page layout of a document that has been detected by a segmentation process”. The types of blocks in document pages basically relate to an important step in document analysis, which is how to separate text from pictures or graphics. Since the data is highly imbalanced, with one of the classes (class 1) largely dominating the rest (it represents nearly 90% of the observations), this data set has been used in the scientific literature for performance assessment of outlier detection algorithms (Campos et al. 2016). In this context, class 1 (“text”) is usually taken as the inliers (4913 observations), whereas classes 2, 3, 4, and 5 together (“horizontal line”, “graphic”, “vertical line”, “picture”) are considered as outliers (560 observations). Semantically, if the block content is text, it is labeled as inlier, otherwise it is labeled as outlier. In total there are 5473 observations, of which 560 (10.23%) are outliers. The following code reads the data set, separates the predictors (variables 1 to 10) from the class labels (variable 11), and transforms the class labels so that inliers are labeled as 0 whereas outliers are labeled as 1. You are asked to: (a) run KNN Outlier for different values of the parameter \(k\) from 1 to 50, using only the 10 predictor variables (PB_Predictors in the code below). For each value of \(k\), take the outlier scores and apply a binary transformation so that the 560 top ranked observations are labeled as outliers, while the others are labeled as inliers according to the unsupervised KNN Outlier algorithm; (b) Compute the precision@n measure, which is how many true outliers (according to the ground truth class labels, i.e., those for which PB_class in the code below takes value 1) are detected as outliers. Ideally, if all the 560 true outliers are ranked above the inliers according to the outlier scores, the precision@n measure will be 1. In the other extreme, if the top 560 observations according to the outlier scores are actually inliers according to the ground truth, then precision@n will be zero. Plot precision@n as a function of \(k\); (c) Notice that the KNN distances can be dominated by variables ranging within wider intervals, so it may be a good idea to normalise the variables before any processing. For instance, summary(PageBlocks) shows that variable V3 ranges within \([7, 143.993]\), whereas V5 ranges within \([0.052,1]\). This way, the latter will have very little (if any) influence on the results. Repeat items (a) and (b) above after rescaling the predictors as PB_Predictors <- scale(PageBlocks[,1:10]). Do the precision@n values improve after normalisation?
PageBlocks <- read.table("page-blocks.csv", header=FALSE, sep="", dec=".")
str(PageBlocks) # 10 Predictors (V1 to V10) and class labels (V11)
## 'data.frame': 5473 obs. of 11 variables:
## $ V1 : int 5 6 6 5 6 5 6 5 5 5 ...
## $ V2 : int 7 7 18 7 3 8 4 6 5 7 ...
## $ V3 : int 35 42 108 35 18 40 24 30 25 35 ...
## $ V4 : num 1.4 1.17 3 1.4 0.5 ...
## $ V5 : num 0.4 0.429 0.287 0.371 0.5 0.55 0.417 0.333 0.4 0.486 ...
## $ V6 : num 0.657 0.881 0.741 0.743 0.944 1 0.708 0.333 0.52 0.914 ...
## $ V7 : num 2.33 3.6 4.43 4.33 2.25 2.44 2.5 10 10 8.5 ...
## $ V8 : int 14 18 31 13 9 22 10 10 10 17 ...
## $ V9 : int 23 37 80 26 17 40 17 10 13 32 ...
## $ V10: int 6 5 7 3 4 9 4 1 1 2 ...
## $ V11: int 1 1 1 1 1 1 1 1 1 1 ...
PB_Predictors <- PageBlocks[,1:10] # 10 Predictors (V1 to V10)
PB_class <- PageBlocks[,11] # Class labels (V11)
PB_class <- ifelse(PB_class == 1,0,1) # Inliers (class "1") = 0, Outliers (classes "2", "3", "4", "5") = 1
Local Outlier Factor (LOF) is a more sophisticated algorithm designed to overcome the main limitation of KNN Outlier previously discussed, having the ability to detect local outliers even in the presence of clusters with varied densities. LOF is a local method because it takes only a neighbourhood around each observation as a reference for measuring outlyingness, rather than the whole data set. The neighbourhood is defined precisely by the observation’s \(k\) nearest neighbours. LOF uses the distances of the observation to its \(k\) nearest neighbours to estimate the density around the observation. It also estimates the densities around the neighbours, as well as around the neighbours of the neighbours, in an analogous way. The intuition is that, for an observation inside a cluster (i.e., an inlier) the contrast between the density around an observation and the densities around its neighbours is very similar to the contrast between the densities around each neighbour and around this neighbour’s own neighbours. As for an outlying observation, in particular a local outlier whose neighbours tend to be part of a cluster, there will be a significant discrepancy between the density of this observation relative to the density of its neighbours (big difference) and the density of each neighbour relative to the density of its own neighbours (small difference).
An interesting property of LOF scores is that, unlike KNN Outlier scores, which are only meaningful in relative terms, LOF scores are also interpretable in absolute terms. In particular, LOF scores of inliers inside a cluster following a uniform distribution are expected to be around 1. LOF scores smaller than 1 suggest inliers lying deeply in denser regions inside a cluster. LOF scores larger than 1 suggest observations in the vicinity of clusters for values slightly larger than 1, then local outliers for larger values, then global outliers for values significantly larger than 1 (typically larger than 2 or 3).
LOF is also available in package dbscan. The following code uses function lof() to compute the LOF scores for the data set in Fig. 1:
k <- 4 # LOF parameter
LOF_Outlier <- lof(x=dat, k = k) # LOF (outlier score) computation
The following code sorts the observations according to their LOF Outlier scores and displays the top 20 outliers along with their scores:
top_n <- 20 # No. of top outliers to be displayed
rank_LOF_Outlier <- order(x=LOF_Outlier, decreasing = TRUE) # Sorting (descending)
LOF_Result <- data.frame(ID = rank_LOF_Outlier, score = LOF_Outlier[rank_LOF_Outlier])
head(LOF_Result, top_n)
## ID score
## 1 237 2.916624
## 2 191 2.908333
## 3 53 2.615192
## 4 130 2.539595
## 5 63 2.439944
## 6 10 2.402330
## 7 145 2.289441
## 8 54 2.091639
## 9 59 1.998936
## 10 333 1.992135
## 11 228 1.944133
## 12 104 1.840832
## 13 77 1.801824
## 14 188 1.766500
## 15 208 1.753433
## 16 168 1.726824
## 17 41 1.663926
## 18 331 1.662891
## 19 6 1.619528
## 20 15 1.616930
We can see that the most outlying observation is observation 237, with outlier score = 2.91, followed by observation 191 with outlier score = 2.90, so on and so forth. Since this data set is 2D, we can visualise the top 20 outliers (highlighted in red) alongside with their IDs in Fig. 4.
g3 <- g0a +
geom_point(data=dat[rank_LOF_Outlier[1:top_n],],mapping=aes(x=x1, y=x2),shape=19,color="red",size=2) +
geom_text(data=dat[rank_LOF_Outlier[1:top_n],],
mapping=aes(x=(x1-0.5), y=x2, label=rank_LOF_Outlier[1:top_n]), size=2.5)
g3
Fig. 4. LOF (\(k = 4\)): red points are the top 20 outliers, with their IDs
Fig. 4 shows that, while the two most outlying observations according to the KNN Outlier (237 and 208) are still detected in the top 20 list by LOF, observation 208 is now ranked 15th, below local outliers such as 191, 53 and others. Moreover, the observations in the vicinity of the larger, sparse cluster are no longer detected as top outliers, because their density profiles relative to the other observations in the cluster are not unusual. Instead, a number of local outliers relative to the other clusters have been ranked at the top.
Fig. 4 suggests that LOF can detect local outliers, but it may sometimes overemphasize these outliers. A more complete picture of the LOF scores for different observations can be seen in Fig. 5, which is analogous to Fig. 3 (mutatis mutandis).
g4 <- g0b +
geom_point(data=dat, mapping=aes(x=x1, y=x2, size = LOF_Outlier), shape = 1, color = "red") +
scale_size_continuous(range = c(0.1, 20))
g4
Fig. 5. LOF (\(k = 4\)): circle radii are proportional to outlier score
Exercise 3: Repeat Exercise 2 above (Page Blocks data) just replacing KNN Outlier with LOF.
Global-Local Outlier Scores from Hierarchies (GLOSH) (R. Campello et al. 2015) is an outlier score that can be computed as a byproduct of the HDBSCAN* density-based clustering hierarchy, which we have studied in the previous module. The method is based on a ratio involving the density around each observation and the densest point in its “closest” cluster, from a density-based connectivity perspective. For local outliers, the closest cluster in the hierarchy tends to be a density-based cluster nearby the observation, so GLOSH behaves locally. As for a global outlier, the closest “cluster” tends to be the whole data set (the root of the hierarchy), and the method behaves globally. The score falls within the range \([0,1]\). Values close to zero indicate clear inliers, whereas values close to 1 suggest clear outliers. Global outliers are expected to be ranked above local outliers, which are expected to be ranked above inliers.
GLOSH is also available in package dbscan. The following code uses function glosh() to compute the GLOSH scores for the data in Fig. 1:
MinPts <- 4 # GLOSH parameter: same as MinPts in hdbscan(), analogous to k in lof()
GLOSH_Outlier <- glosh(x=as.matrix(dat), MinPts) # GLOSH (outlier score) computation
The following code sorts the observations according to their GLOSH scores and displays the top 20 outliers along with their scores:
top_n <- 20 # No. of top outliers to be displayed
rank_GLOSH_Outlier <- order(x=GLOSH_Outlier, decreasing = TRUE) # Sorting (descending)
GLOSH_Result <- data.frame(ID = rank_GLOSH_Outlier, score = GLOSH_Outlier[rank_GLOSH_Outlier])
head(GLOSH_Result, top_n)
## ID score
## 1 237 0.9640308
## 2 191 0.9224226
## 3 208 0.9136309
## 4 228 0.8899238
## 5 166 0.8804751
## 6 188 0.8795091
## 7 145 0.8757610
## 8 10 0.8744444
## 9 240 0.8580376
## 10 63 0.8504281
## 11 251 0.8446609
## 12 53 0.8408036
## 13 262 0.8375568
## 14 243 0.8345867
## 15 267 0.8277364
## 16 229 0.8186041
## 17 59 0.8173426
## 18 281 0.8159289
## 19 233 0.8147168
## 20 289 0.8048307
We can see that the most outlying observation is observation 237, with outlier score = 0.96, followed by observation 191 with outlier score = 0.92, then observation 208 with score 0.91, so on and so forth. Since this data set is 2D, we can visualise the top 20 outliers (highlighted in red) alongside with their IDs in Fig. 6.
g5 <- g0a +
geom_point(data=dat[rank_GLOSH_Outlier[1:top_n],],mapping=aes(x=x1,y=x2),shape=19,color="red",size=2) +
geom_text(data=dat[rank_GLOSH_Outlier[1:top_n],],
mapping=aes(x=(x1-0.5), y=x2, label=rank_GLOSH_Outlier[1:top_n]), size=2.5)
g5
Fig. 6. GLOSH (\(MinPts = 4\)): red points are the top 20 outliers, with their IDs
A more complete picture of the GLOSH scores for different observations can be seen in Fig. 7, which is analogous to Figs. 3 and 5.
g6 <- g0b +
geom_point(data=dat, mapping=aes(x=x1, y=x2, size = GLOSH_Outlier), shape = 1, color = "red") +
scale_size_continuous(range = c(0.1, 15))
g6
Fig. 7. GLOSH (\(MinPts = 4\)): circle radii are proportional to outlier score
Since the GLOSH scores fall within the interval \([0,1]\), the contrast between inliers and outliers in Fig. 7 can be improved by making the radii of the circles proportional to the GLOSH scores squared or cubed, which affects much less values closer to 1. The result is visualised in Fig. 8.
GLOSH_Outlier_Transf <- ( (GLOSH_Outlier - min(GLOSH_Outlier)) / (max(GLOSH_Outlier) - min(GLOSH_Outlier)) )^3
g7 <- g0b +
geom_point(data=dat, mapping=aes(x=x1, y=x2, size = GLOSH_Outlier_Transf), shape = 1, color = "red") +
scale_size_continuous(range = c(0.1, 15))
g7
Fig. 8. GLOSH (\(MinPts = 4\)): circle radii rescaled and transformed
Fig. 9 shows the same visualisation, but now for the GLOSH scores computed using a larger value of \(MinPts\). It can be seen that \(MinPts\) works as a smoothing factor, as it does for HDBSCAN*.
MinPts <- 15 # GLOSH parameter
GLOSH_Outlier <- glosh(x=as.matrix(dat), MinPts) # GLOSH (outlier score) computation
GLOSH_Outlier_Transf <- ( (GLOSH_Outlier - min(GLOSH_Outlier)) / (max(GLOSH_Outlier) - min(GLOSH_Outlier)) )^3
g9 <- g0b +
geom_point(data=dat, mapping=aes(x=x1, y=x2, size = GLOSH_Outlier_Transf), shape = 1, color = "red") +
scale_size_continuous(range = c(0.1, 15))
g9
Fig. 9. GLOSH (\(MinPts = 15\)): circle radii rescaled and transformed
Another interesting visualisation is displayed in Fig. 10, where the top 50 outliers are identified by their outlier ranks (“1” means the most outlying, “2” the 2nd most outlying, etc.), rather than their IDs.
top_n <- 50 # No. of top outliers to be displayed
rank_GLOSH_Outlier <- order(x=GLOSH_Outlier, decreasing = TRUE) # Sorting (descending)
g8 <- g0a +
geom_point(data=dat[rank_GLOSH_Outlier[1:top_n],],mapping=aes(x=x1,y=x2),shape=19,color="red",size=2) +
geom_text(data=dat[rank_GLOSH_Outlier[1:top_n],],
mapping=aes(x=(x1-0.5), y=x2, label=c(1:top_n)), size=2.5)
g8
Fig. 10. GLOSH (\(MinPts = 15\)): red points are the top 50 outliers, with their outlier ranks
Exercise 4: Repeat Exercise 2 above (Page Blocks data) just replacing KNN Outlier with GLOSH. Note: use \(MinPts = 2, \cdots, 50\), given that function glosh() from the dbscan package is not defined for \(MinPts = 1\).
Principal component analysis (PCA) is a classic multivariate analysis technique that is used for different purposes in the context of data mining, most noticeably for dimensionality reduction and data visualisation. Details on the mathematical formulation and statistical interpretations of PCA are discussed in other subjects of this course. Here, we focus on the intuition around PCA and its use in the context of data mining. The reader is referred to Chapter 10 of (James et al. 2013) for further details (see Recommended Reading).
The basic idea behind PCA is very simple: learn, in an unsupervised way, a rotation of the coordinates system such that most of the variability in the data is captured by a few coordinates, such that the data can be approximately described by those coordinates only. For the sake of illustration, let’s consider a 2-dimensional example:
set.seed(1)
x1 <- runif(n = 100, min = 0, max = 50)
x2 <- x1 + rnorm(n = 100, mean = 0, sd = 1)
dat <- data.frame(x=x1, y=x2)
(g_PCA <- ggplot() + geom_point(data=dat, mapping=aes(x=x, y=y), shape = 19))
Fig. 11. Two dimensional data: highly correlated variables
In the scatter plot of Fig. 11 we can see that the two variables, x1 and x2, are highly correlated. Indeed, since x2 is just x1 with some additive noise, it is intuitive that a single variable should suffice to explain the most important aspect of the data, which is the fact that, apart from the noise, the two variables are linearly correlated (x1 = x2). We can use function prcomp() from the stats package in base R to compute PCA for this data set:
PCA <- prcomp(dat, scale = TRUE)
PCA$rotation
## PC1 PC2
## x 0.7071068 -0.7071068
## y 0.7071068 0.7071068
Variable rotation stores the resulting (rotated) coordinates system. Each column represents a vector along the corresponding coordinate direction, and therefore the columns of rotation are orthogonal. Notice that, since the original data space is 2-dimensional, we have only 2 principal components (coordinates PC1 and PC2). We can visualise these principal components as a rotated coordinates system, where the 1st principal component is represented in blue and the 2nd principal component is represented in red (both rescaled to allow visualisation):
Fig. 12. Two dimensional data and PCA coordinates
The 1st component, shown in blue in Fig. 12, is the direction along which the variance of the data is maximum. The second component is a direction along which the variance is maximum given that it is orthogonal to the 1st component (so on and so forth, if there were more than two components). In this case we have only two components, so there is only one direction orthogonal to the 1st component (apart from the signal, positive or negative, which does not affect the analysis), which determines the 2nd component displayed in red.
The notion of order of principal components according to the variance of the data along the corresponding directions is quantified by variable sdev, which stores the standard deviation along each of the principal components resulting from prcomp(). The square sdev^2 returns the respective variances, and by dividing these by the total sum of variances, we get the Proportion of the Variance Explained (PVE) by each component:
(PVE <- (PCA$sdev^2)/sum(PCA$sdev^2))
## [1] 0.998795182 0.001204818
We can see that the 1st component explains about 99.88% of the variance, whereas the 2nd component explains the remaining 0.12%. Intuitively, we should therefore expect that, if we disregard the 2nd component, the representation of the data with the 1st component only (PC1) should incur minimal loss. Interestingly, it can be shown that the 1st component is the line (1-dimensional hyperplane) such that the sum of the squared differences between the original data and the data as projected onto that line is minimal (notice that this is related to, but it is not the same as the least squares regression of x2 as a function of x1 or vice versa). In general, for an \(n\)-dimensional data set, the first \(m < n\) principal components define an \(m\)-dimensional hyperplane such that the sum of the squared differences between the original data and the data as projected onto that hyperplane is minimal.
There are at least two important applications of PCA in data mining. The first obvious application is dimensionality reduction as a preprocessing step prior to the application of other (possibly supervised) algorithms that can benefit from operating on a lower dimensional space, for a variety of reasons that relate to the different challenges associated with high dimensional spaces.
Another important application of PCA in data mining is visualisation. For \(n\)-dimensional data sets where \(n>3\), the data cannot be directly visualised in the original coordinates space. If the first 2 or 3 components can accurately represent the data, containing most of the PVE, then the corresponding 2D or 3D plot can be used as a reasonable, approximate visualisation of the data.
As an example, let’s consider once again the Page Blocks data set from the UCI Data Repository, as well as its representation as a binary outlier detection problem (inlier = “0” vs outlier = “1”).
PageBlocks <- read.table("page-blocks.csv", header=FALSE, sep="", dec=".")
PB_Predictors <- scale(PageBlocks[,1:10]) # 10 Predictors (V1 to V10) normalised
PB_class <- PageBlocks[,11] # Class labels (V11)
PB_class <- ifelse(PB_class == 1,0,1) # Inliers (class "1") = 0, Outliers (classes "2", "3", "4", "5") = 1
By applying PCA to the 10-dimensional space of the predictors, we can see that the first 2 components alone explain almost 60% of the variance:
PCA_PageBlocks <- prcomp(PB_Predictors, scale = TRUE)
(PVE <- (PCA_PageBlocks$sdev^2)/sum(PCA_PageBlocks$sdev^2))
## [1] 0.417521176 0.161574969 0.146737436 0.096284858 0.064793304
## [6] 0.042132200 0.034274913 0.026323278 0.008870243 0.001487621
It is interesting to visualise the data in these two principal components, Pc1and PC2, with the ground truth outliers (according to the class labels) highlighted in blue:
PCA_2D_dat <- as.data.frame(PCA_PageBlocks$x[,1:2])
g_PCA <- ggplot() + geom_point(data=PCA_2D_dat, mapping=aes(x=PC1, y=PC2), shape = 19)
(g_ground_truth <- g_PCA +
geom_point(data=PCA_2D_dat[PB_class==1,], mapping=aes(x=PC1,y=PC2), shape=19, color="blue", size=2))
Fig. 13. Page Blocks data set (2 principal components): ground truth outliers are highlighted in blue
Fig. 13 suggests that the data, at least as approximated by the first 2 principal components, follows a “comet-like” shape, consisting of a dense cluster with a tail of scattered outliers. We can notice that a large fraction of these scattered outliers in the tail are indeed outliers according to the class labels (in blue). However, many observations that are labeled as outliers according to the class labels actually fall deep inside the cluster, and as such they are unlikely to be detected as outliers by an unsupervised outlier detection method if this is a trustworthy representation of the full 10-dimensional data space. This probably explains why unsupervised outlier detection methods such as KNN Outlier, LOF and GLOSH cannot reach precision@n values much higher than 0.5 for this data set.
It is an interesting exercise to visualise how the observations ranked as top outliers according to these unsupervised methods look like in the PC1 vs PC2 space. Figure 14 shows the result for LOF with \(k = 50\).
k <- 50
LOF_Outlier <- lof(x=PB_Predictors, k = k)
rank_LOF_Outlier <- order(x=LOF_Outlier, decreasing = TRUE)
top_n <- 560
(g_LOF_top_n <- g_PCA +
geom_point(data=PCA_2D_dat[rank_LOF_Outlier[1:top_n],],mapping=aes(x=PC1,y=PC2),shape=19,color="red1",size=2))
Fig. 14. Page Blocks data set (2 principal components): LOF top 560 outliers highlighted in red
Notice in Fig. 14 that most of the visually outlying observations in the “comet tail” have indeed been ranked at the top 560 according to the LOF scores, but since there aren’t 560 observations in the tail (which is the number of outliers according to the ground truth), inevitably part of the LOF top 560 candidates have fallen inside the cluster. It is then interesting to visualise only those observations that have been given a LOF score significantly higher than 1 (clear outliers according to LOF). Fig 15 shows in red those observations with LOF scores higher than 2, whereas Fig 15 shows those observations with LOF scores higher than 3. As expected, we can see that most of the visually outlying observations in “comet tail” have LOF scores greater than 2, and those further away at the top right corner have even higher LOF scores, greater than 3.
(g_LOF_geq_2 <- g_PCA +
geom_point(data=PCA_2D_dat[LOF_Outlier>=2,], mapping=aes(x=PC1,y=PC2), shape=19, color="red2", size=2))
Fig. 15. Page Blocks data set (2 principal components): LOF scores >= 2 highlighted in red
(g_LOF_geq_3 <- g_PCA +
geom_point(data=PCA_2D_dat[LOF_Outlier>=3,], mapping=aes(x=PC1,y=PC2), shape=19, color="red3", size=2))
Fig. 16. Page Blocks data set (2 principal components): LOF scores >= 3 highlighted in red
Exercise 5: Repeat the experiment above, producing plots analogous to those in Figs. 14 to 16, but with GLOSH rather than LOF. Use GLOSH Scores higher than 0.9 and 0.95 as a counterpart for LOF Scores higher than 2 and 3, respectively.
Exercise 6: Apart from their comparison with Fig. 13, which wouldn’t be possible in a practical application of unsupervised outlier detection as the class labels in Fig. 13 would not be available, Figs. 14 to 16 themselves suggest that \(k = 50\) is a reasonable choice for the LOF parameter in the Page Blocks data set. Repeat the experiment above, producing plots analogous to those in Figs. 14 to 16, but with \(k = 4\). Does \(k = 4\) seem to be a reasonable choice for LOF in this data set, especially when compared with \(k = 50\)? Does this suggest that we can use PCA as an auxiliary Exploratory Data Analysis tool that can help setting parameters in an unsupervised learning scenario?
Utilising R/RStudio, reproduce the experimental procedures and results described in Sections 10.4 and 10.6.1 of (James et al. 2013), “Lab 1: Principal Components Analysis” and “Lab 3: PCA on the NCI60 Data”. The latter involves the cancer cell line microarray data, NCI60, from the ISLR package.
Barnett, V., and T. Lewis. 1994. Outliers in Statistical Data. 3rd ed. Wiley.
Campello, R.J.G.B., D. Moulavi, A. Zimek, and J. Sander. 2015. “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 10 (1): 5:1–5:51.
Campos, G. O., A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, and M. E. Houle. 2016. “On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study.” Data Mining and Knowledge Discovery 30 (4): 891–927.
Hawkins, D. 1980. Identification of Outliers. Chapman; Hall.
James, G., D. Witten, Hastie T., and R. Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. Springer.
Tan, P.-N., M. Steinbach, and V. Kumar. 2006. Introduction to Data Mining. Addison Wesley.