Unsupervised Learning and Clustering

Recall from our first module that in unsupervised learning we don’t have a “teacher” $Y$, i.e., in contrast to supervised learning, we don’t have a target variable that we want to model/predict. One of the most important unsupervised data mining tasks is clustering.

In clustering we are interested to find groups (clusters) of observations that are more similar or related to each other than observations in other groups. Cluster analysis is a well-established field of statistics since the 1950s, when analysts were interested in finding, for instance, groups of customers with similar purchasing behaviour and/or interests. This is a classic problem in cluster analysis called market segmentation. With the advent of modern techniques that are both effective and computationally efficient to tackle large amounts of data, clustering has drawn increasing attention among data scientists. Nowadays, it finds applications in virtually all fields of knowledge (e.g. biology, astronomy, geology, archaeology, finance, business, just to mention a few). In genetics, for example, analysts may want to find groups of genes whose expression signatures under certain conditions are related, as these may be jointly associated with a certain function (or misfunction) in a given organism.

There are different types of clustering algorithms, which can be categorized into major clustering paradigms. The most important paradigms are partitional and hierarchical clustering.

Partitional Clustering

In partitional clustering, we essentially want to categorise the data observations into disjoint subsets. For instance, we may find that in a database with $N=1000$ customers (observations), there are $k = 3$ groups of customers with similar purchasing behaviour, such that every customer will then be assigned to one of these groups (clusters). Formally, let $\mathbf{X} = \{\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N\}$ be a data set containing $N$ observations, $\mathbf{x}_i$ ($i=1, \cdots, N$), each of which is described by $n$ variables, i.e., $\mathbf{x}_i = [x_{i1} x_{i2} \cdots x_{in}]$. Notice that $\mathbf{x}_i$ is the $i$th row of an $N \times n$ data matrix where rows correspond to observations and columns correspond to variables.

A partition $\mathbf{P} = \{\mathbf{C}_1, \mathbf{C}_2, \cdots, \mathbf{C}_k\}$ of this data into $k$ clusters is just a collection of subsets of observations, $\mathbf{C}_j$ ($j=1, \cdots, k$), such that:

$\mathbf{C}_1 \cup \mathbf{C}_2 \cup \cdots \cup \mathbf{C}_k = \mathbf{X}$
$\mathbf{C}_i \cap \mathbf{C}_j = \emptyset$ ($\forall i \neq j$)
$\mathbf{C}_i \neq \emptyset$ ($\forall i$).

In words: (a) the union of the subsets (clusters) is the data set, which means that every observation has been assigned to a cluster; (b) the intersection of any pair of clusters is empty, which means that no observation is assigned at the same time to more than one cluster; and (c) every cluster is non-empty, which means that at least one observation is assigned to any of the clusters. For instance, given a data set with $N=10$ observations, $\mathbf{C}_1 = \{\mathbf{x}_3, \mathbf{x}_5\}$, $\mathbf{C}_2 = \{\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_7, \mathbf{x}_8, \mathbf{x}_{10}\}$, and $\mathbf{C}_3 = \{\mathbf{x}_4, \mathbf{x}_6, \mathbf{x}_9\}$ is a valid partition into $k=3$ groups.

Some clustering algorithms relax one or more of these constraints, for example allowing an observation to belong partially to more than one cluster, or allowing some outlying observations not to be assigned to any cluster. We will discuss examples of the latter category later in this module, but for now we start strictly following the definition of clustering partition above.

The k-Means Algorithm

k-Means is arguably the most well-known of all clustering algorithms, and the ideas behind the algorithm refer back to the 1950s (even though the most cited publication is from 1967). The algorithm is very simple and intuitive. In its original form, the algorithm assumes that all $n$ variables that describe the data are real-valued, i.e., each observation $\mathbf{x}_i = [x_{i1} x_{i2} \cdots x_{in}]$ is a numeric vector that can be seen as a point in an $n$-dimensional coordinates space.

k-Means works by adjusting $k$ cluster “prototypes”, which are representatives of each cluster. Assuming that there are $k$ clusters in the data, the idea is to adjust the prototypes so that each cluster will be represented by one of the prototypes, as well as possible. In k-Means, “as well as possible” means that the prototype of a cluster, as the best representative of that cluster, must be the point that is the most similar to the observations of the cluster; and “most similar” is then defined as the point that minimises the sum of squared (Euclidean) distances to the observations in the cluster. So, the overall goal is to select $k$ points (prototypes) $\mathbf{\bar{x}}_1, \cdots, \mathbf{\bar{x}}_k$ such that the sum of the squared distances from these points to the observations of their respective clusters (so-called Sum of Squared Erros — SSE) is minimized:

Equation (1): \[ \large SSE = \displaystyle \sum_{i=1}^k \sum_{\mathbf{x}_j \in \mathbf{C}_i} d(\mathbf{x}_j,\mathbf{\bar{x}}_i)^2\]

It is straightforward to show that, given the subset of observations in a cluster, say $\mathbf{C}_i$ (containing $|\mathbf{C}_i|$ observations, where $|\cdot|$ denotes set cardinality), the point $\mathbf{\bar{x}}_i$ that minimises the sum of squared distances to the observations in that cluster ($\mathbf{x}_j \in \mathbf{C}_i$) is the multivariate mean of those observations (i.e., the mean taken independently along each of the $n$ coordinates), the so-called cluster centroid:

Equation (2): \[ \large \mathbf{\bar{x}}_i = \frac{1}{|\mathbf{C}_i|} \cdot \displaystyle \sum_{\mathbf{x}_j \in \mathbf{C}_i} \mathbf{x}_j\]

For example, suppose that a cluster $\mathbf{C}_i$ is composed of the following $|\mathbf{C}_i| = 3$ observations described by $n=2$ variables: $\mathbf{x}_1 = [2 \;\; -3]$, $\mathbf{x}_2 = [0 \;\; 1]$, and $\mathbf{x}_3 = [1 \;\; -4]$, then the centroid of this cluster is:

\[\mathbf{\bar{x}}_i = \displaystyle \frac{\mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3}{3} = [1 \;\; -2]\]

The problem is that we don’t know the clusters in the first place, so how can we compute their prototypes!? Initially, these prototypes can be chosen randomly, either by choosing $k$ random observations from the data set or by choosing $k$ random points (which do not need to coincide with any observation) in the $n$-dimensional data space. Then, we assign each observation to its closest prototype, because we want to minimize the SSE in Eq. (1). Once this assignment is completed, we have $k$ clusters, for which we can compute centroids using Eq. (2). But once we compute these centroids, we need to reassign each observation to its nearest centroid, and then the centroids may have to be adjusted again, etc. It is possible to show that by iterating the assignment of observations to cluster centroids followed by re-computation of centroids will eventually converge a state where no observation changes clusters and therefore centroids no longer change, hence the algorithm can stop. In summary, the steps are the following:

Choose $k$ prototypes randomly
Assign each of the $N$ observations in the data set to its closest prototype
Recompute the prototypes using Eq. (2)
Repeat 2 and 3 until nothing changes from the previous iteration (or until a maximum number of iterations is reached).

In each step, the algorithm above will reduce (or, at least, not increase) the SSE, so it will eventually converge to a solution of minimum SSE. However, the algorithm does not necessarily converge to the best solution overall (so-called globally optimal solution); it may rather converge to a local minima, which is the best solution the algorithm can reach when starting from the prototypes initially chosen, but not necessarily the best solution possible: there is no guarantee that a better solution could not be reached from different initial prototypes.

Let’s consider the following example, where we generate 4 clusters following multivariate normal distributions in two dimensions:

set.seed(0)
x11 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 1 (x1 coordinate)
x21 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 1 (x2 coordinate)
x12 <- rnorm(n = 100, mean = 20, sd = 1) # Cluster 2 (x1 coordinate)
x22 <- rnorm(n = 100, mean = 10, sd = 1) # Cluster 2 (x2 coordinate)
x13 <- rnorm(n = 100, mean = 15, sd = 3) # Cluster 3 (x1 coordinate)
x23 <- rnorm(n = 100, mean = 25, sd = 3) # Cluster 3 (x2 coordinate)
x14 <- rnorm(n = 50, mean = 25, sd = 1)  # Cluster 4 (x1 coordinate)
x24 <- rnorm(n = 50, mean = 25, sd = 1)  # Cluster 4 (x2 coordinate)
dat <- data.frame(x1 = c(x11,x12,x13,x14), x2 = c(x21,x22,x23,x24))
plot(dat$x1, dat$x2, xlim = c(5,30), ylim = c(5,35), xlab = "x1", ylab = "x2")

Fig. 1. Illustrative data set: 4 clusters following multi-normal distributions

We can run k-Means with $k = 4$ in this data set using function kmeans() from the stats package available in base R:

set.seed(0)
km.out <- kmeans(x = dat, centers = 4)
km.out

## K-means clustering with 4 clusters of sizes 149, 100, 45, 56
## 
## Cluster means:
##         x1        x2
## 1 18.22719 24.864055
## 2 10.02267  9.954448
## 3 20.96405 10.425010
## 4 19.35992  9.778065
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 3 3 3
## [106] 3 4 3 4 4 3 4 4 4 3 3 4 3 3 4 3 4 4 4 3 4 3 4 4 3 4 3 4 3 4 3 3 4 4 4
## [141] 3 4 3 4 4 4 3 4 4 4 4 3 3 3 3 3 4 3 4 4 3 3 4 3 3 3 4 4 4 4 4 3 4 4 4
## [176] 3 3 3 4 4 3 4 3 4 3 3 4 4 4 4 3 3 4 4 4 4 4 3 3 4 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [316] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 5387.31217  169.41272   90.79615   73.42784
##  (between_SS / total_SS =  81.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

In the summary above, we can see the resulting 4 cluster centroids, the labeling of the data observations into one of the corresponding clusters (labeled 1, 2, 3 and 4), and the SSE per cluster, besides the list of other available variables. For instance, tot.withinss gives the total SSE (that the algorithm tries to minimise) in Eq. (1):

km.out$tot.withinss

## [1] 5720.949

The result can be visualised by mapping the cluster labels into colours:

plot(dat, col=(km.out$cluster+1), main="k=4: single initialisation with set.seed(0)", xlab="x1", ylab="x2")

Fig. 2. k-Means with $k = 4$: single initialisation with with set.seed(0)

Notice that the result does not correspond to the four clusters as they have been generated. In fact, the two clusters at the top have been merged into a single cluster, while the bottom right cluster has been split into two. This result does not meet our perception of natural clusters, which is visually clear in this example. The reason is that the algorithm got trapped into a local minima as described above. We can confirm this by running the algorithm again from a different random initialisation, and noticing that the result will be very different:

set.seed(2)
km.out <- kmeans(x = dat, centers = 4)
km.out$tot.withinss

## [1] 5725.609

plot(dat, col=(km.out$cluster+1), main="k=4: single initialisation with set.seed(2)", xlab="x1", ylab="x2")

Fig. 3. k-Means with $k = 4$: single initialisation with with set.seed(2)

In fact, notice that now the bottom left cluster has been split, rather then the right one. The question is then, how can we try to avoid these sub-optimal solutions? To answer this question, recall that the final solution depends on the initialisation of the algorithm (initial prototypes), so the usual way to circumvent the problem is by running the algorithm from different initial prototypes. Once many candidate solutions have been derived, we can choose the best one according to the SSE in Eq. (1).

Function kmeans() provides a built-in implementation of this procedure, activated by setting attribute nstart to an integer greater than 1 (nstart = 1 is the default value, which corresponds to a single initialisation). Let’s try nstart = 10 in our example, which means running k-Means 10 times and then choose the solution with least SSE value:

set.seed(0)
km.out <- kmeans(x = dat, centers = 4, nstart = 10)
km.out$tot.withinss

## [1] 2175.64

plot(dat, col=(km.out$cluster+1), main="k=4: best out of 10 initialisations", xlab="x1", ylab="x2")

Fig. 4. k-Means with $k = 4$: best out of 10 initialisations

Notice that the solution above is visually much better, and the corresponding SSE is less than half the ones we had obtained from the two previous individual initialisations.

Exercise: Try to increase values of nstart (1, 5, 10, …, 50) and see if solutions with even better SSE values can be reached (note: use set.seed(0) in all cases). Plot SSE versus nstart.

Clustering Evaluation and Model Selection

One of the most challenging aspects of unsupervised tasks such as clustering is the fact that we don’t have a reference or ground truth solution to assess the performance of the results. Unlike supervised tasks such as classification and regression, which allow us to assess performance by comparing the desired output $Y$ with the predicted output $\hat{Y}$, there is no target output in clustering. This makes it particularly difficult to perform model selection, i.e., choosing between different models or different parameter choices for the same model.

In the k-means algorithm, model selection is required when we don’t know the number of clusters in the data, which is the case in most practical applications of clustering. This means that we don’t know how to set the parameter $k$. One may wonder why we don’t just use the same approach that we used to select a good initialization of the prototypes: run k-means several times for varied $k$ and choose the value of $k$ that provides the best (minimum) SSE.

The problem with this approach is that the SSE tends to decrease when $k$ increases, no matter how many clusters there are in the data. To understand this, think of the two extreme cases: (a) we have only one prototype, then this prototype will be in the grand mean (centre) of the data set and the SSE will be large, and (b) the number of prototypes is equal to the number of observations in the data set, $k = N$, then each observation has a prototype on its own and the SSE will be zero. The other cases are something in between, with SSE decreasing towards zero as $k$ increases towards $N$. We can see this using once again our illustrative example in Fig. 1.

set.seed(0)
SSE <- rep(0, 10)
for (k in 1:10){
  km.out <- kmeans(x = dat, centers = k, nstart = 10)
  SSE[k] <- km.out$tot.withinss
}
plot(1:10, SSE, xlab="k", ylab="SSE", type = "b")

Fig. 5. k-Means with $k = 1, 2, ..., 10$: best out of 10 initialisations

Notice in Fig. 5 that, as expected, the SSE decreases with increasing $k$. However, there is a “knee” or “elbow” at $k = 4$, after which the decrease in SSE is much less accentuated than before. When there are clear, well-behaved natural clusters in the data, this knee is expected to be observed at around the right number of clusters, which can then be revealed. This is a very common procedure to choose $k$ in practice.

Exercise: Perform the same procedure described above, but now for the iris data set from base R (datasets package). We know that there are three classes in this data set, but two of the classes overlap each other, so we should expect to estimate the best $k$ value around 2 or 3. Is that what you see? Notes: (a) Use only the 4 numerical variables as your data set, i.e., ignore the class labels (variable Species), which would not be available in a practical clustering application; (b) Use set.seed(0) in the beginning of your code.

Exercise: Take the best solution with $k = 3$ and build a contingency table of the corresponding clustering labels (available in variable cluster of the object resulting from the call to function kmeans()) against the true class labels (variable iris$Species). Is there a strong correspondence between the clusters (detected in a completely unsupervised way) and the 3 classes?

Visually identifying a “knee” in plots like Fig. 5 requires human intervention and is not always easy. There are alternative criteria that do not exhibit the monotone decreasing behavior as a function of $k$, which is undesirable in SSE, so these alternative criteria can be used to directly compare clustering solutions with different numbers of clusters, in order to select the best one (according to the criterion used). One of the most widely used such criteria is the Silhouette Width Criterion.

The Silhouette Width Criterion

The Silhouette Width Criterion (SWC) basically assesses the quality of the assignment of an observation to its cluster, by measuring the difference between the average distance of this observation to observations in another cluster (which should be large) and the average distance of this observation to observations in the same cluster (which should be low). If the observation is assigned to its natural cluster, this difference is expected to be high. The silhouette of an observation is just a normalisation of this difference within $[-1,+1]$, and the silhouette of an entire clustering solution is just the mean of the individual silhouettes of all observations, which therefore is also within $[-1,+1]$. The closer to +1, the better the clustering solution.

The individual silhouettes of observations for a given clustering solution can be computed using function silhouette() from package cluster. The following handy function gets the clustering labels clusterLabels for the data set dataPoints and returns the SWC value of that clustering solution.

SWC <- function(clusterLabels, dataPoints){
         require(cluster)
         sil <- silhouette(x = clusterLabels, dist = dist(dataPoints))
         return(mean(sil[,3]))
}

By using this function, we can repeat the experiment in Fig. 5, but now plotting the SWC criterion for the k-Means solutions with varied $k$, rather than the SSE criterion. Notice that, in this case, we start $k$ from 2 as the silhouette is not defined for $k = 1$. Notice that the maximum value of SWC is reached at $k = 4$, which is exactly the number of clusters in the data set in Fig. 1.

set.seed(0)
Silhouette <- rep(0, 10)
for (k in 2:10){
  km.out <- kmeans(x = dat, centers = k, nstart = 10)
  Silhouette[k] <- SWC(clusterLabels = km.out$cluster, dataPoints = dat)
}
plot(2:10, Silhouette[2:10], xlab="k", ylab="Silhouette Width Criterion (SWC)", type = "b")

Fig. 6. k-Means with $k = 2, ..., 10$: best out of 10 initialisations

Hierarchical Clustering

As opposed to partitional clustering, Hierarchical Clustering methods produce a complete collection of clustering solutions with varied numbers of clusters, rather than a single solution. In particular, Agglomerative Hierarchical Clustering (HAC) methods, which are the most well-known and widely used in practice, produce a complete collection of partitions with the number of clusters varying from $k = N$ (every observation in the data set is a cluster on its own — so-called singleton) to $k=1$ (a single “cluster” containing the whole data set), and these partitions are hierarchically related so that the partition with $k$ clusters is contained in the partition with $k-1$ clusters. A partition $\mathbf{P}_k$ is contained in another partition $\mathbf{P}_{k-1}$ if any cluster in $\mathbf{P}_{k}$ is a subset of some cluster in $\mathbf{P}_{k+1}$. For instance, $\mathbf{P}_3 =\{ (\mathbf{x}_3, \mathbf{x}_5), (\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_7, \mathbf{x}_8, \mathbf{x}_{10}), (\mathbf{x}_4, \mathbf{x}_6, \mathbf{x}_9) \}$ is contained in $\mathbf{P}_2 =\{ (\mathbf{x}_3, \mathbf{x}_5, \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_7, \mathbf{x}_8, \mathbf{x}_{10}), (\mathbf{x}_4, \mathbf{x}_6, \mathbf{x}_9) \}$, and in this case we say that the clusters in these partitions are nested.

HACs build a hierarchical tree of nested partitions in a bottom-up fashion. The general HAC algorithm is very simple and intuitive:

Start with every observation being a cluster on its own (i.e., a singleton), which is the so-called trivial clustering solution, where $k = N$.
Merge the pair of clusters that are the most similar (least dissimilar) according to a given dissimilarity measure between clusters.
Repeat Step 2 until there is only one “cluster” containing the whole data set.

Notice in Step 2 that, since two clusters are merged, the number of clusters is reduced in one unit, i.e., a partition $\mathbf{P}_k$ with $k$ clusters is transformed into a partition $\mathbf{P}_{k-1}$ with $k-1$ clusters, and $\mathbf{P}_k$ is contained in $\mathbf{P}_{k-1}$. The algorithm thus produces $N$ partitions, with $k = N, N-1, \cdots, 2, 1$ clusters, respectively.

The only thing that needs to be specified in the above HAC algorithm is the dissimilarity measure between clusters. This is essentially what gives rise to different HAC methods. In the following we discuss classic HAC instances that are most commonly used in practice.

The Single-Linkage Algorithm

In the Single-Linkage (SL) algorithm, the distance between two clusters of observations, $\mathbf{C}_1$ and $\mathbf{C}_2$, is defined as the smallest pairwise distance between these clusters, i.e., the smallest distance of a pair of observations where one observation is in $\mathbf{C}_1$ and the other observation is in $\mathbf{C}_2$. The pairwise distance between observations can be any distance suitable for the application in hand, e.g. Euclidean, and this is all the algorithm needs to operate (unlike k-Means, which needs the observations themselves in order to compute centroids, HACs only need a distance matrix). For example, consider the illustrative data set in Fig. 7.

library(ggplot2)
x1 <- c(1,2,3,4,8,9,10)
x2 <- c(6,1,1,1.5,4.5,5,5)
Point_ID <- as.factor(c(1:7))
dat_7points <- data.frame(x1 = x1, x2 = x2, Point_ID = Point_ID)
( g <- ggplot(data=dat_7points) + geom_point(mapping=aes(x=x1, y=x2), size = 3) +
      geom_text(mapping=aes(x=(x1-0.2), y=x2, label = Point_ID)) )

Fig. 7. Illustrative data set (2 clusters and 1 outlier)

Let’s suppose that in a certain iteration of the algorithm we have the partition $\mathbf{P}_3 =\{ (\mathbf{x}_1), (\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4), (\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)\}$, with 3 clusters, one of which is a singleton (the outlier observation). In the next iteration, one of these clusters has to be merged. According to the HCA algorithm, the closest pair of clusters must be merged. According to SL, the distance between $(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4)$ and $(\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)$ is the smallest pairwise distance between these two clusters, which in case of Euclidean distance is $d(\mathbf{x}_4,\mathbf{x}_5) = 5$. Similarly, the SL distance between $(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4)$ and $(\mathbf{x}_1)$ is $d(\mathbf{x}_2,\mathbf{x}_1) = 5.099$, and the SL distance between $(\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)$ and $(\mathbf{x}_1)$ is $d(\mathbf{x}_5,\mathbf{x}_1) = 7.158$. Therefore, the closest clusters in this iteration are $(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4)$ and $(\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)$. These clusters are merged, resulting in a new partition with 2 clusters, namely $\mathbf{P}_2 =\{ (\mathbf{x}_1), (\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4, \mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)\}$, and the algorithm proceeds.

Function hclust() from the stats package available in base R performs HAC clustering. The SL method can be chosen by setting attribute method to “single”:

DistMatrix <- dist(dat_7points[1:2])
SL_7points <- hclust(DistMatrix, method = "single")

The result is a clustering hierarchy, represented as a tree. The partition with any number of clusters can be extracted from this tree using function cutree(). The following command extracts and displays all the partitions produced by SL for the data set in Fig. 7, where each row corresponds to a partition, each column corresponds to an observation, and the values in each entry are the corresponding cluster labels (always numbered as consecutive integers starting in 1):

cutree(tree = SL_7points, k = 1:7)

##      1 2 3 4 5 6 7
## [1,] 1 1 1 1 1 1 1
## [2,] 1 2 2 2 2 2 2
## [3,] 1 2 2 2 2 2 3
## [4,] 1 2 2 2 3 3 4
## [5,] 1 2 3 3 4 4 5
## [6,] 1 2 3 4 5 5 6
## [7,] 1 2 3 4 5 6 7

This functionality is interesting when we are interested in a particular partition with $k$ clusters. For instance, the 3rd row from top to bottom, which corresponds to $k = 3$, shows that the partition is composed of a singleton (1st observation) and two other clusters with 3 observations each (2nd, 3rd, and 4th observations in one cluster, labeled “2”, 5th, 6th, and 7th observations in the other cluster, labeled “3”). However, it is not easy to track the sequence of mergers across different partitions, especially if there are many observations. In other words, it is not trivial to visualise the clustering hierarchy (tree).

The clustering hierarchy can be conveniently visualised as a dendrogram, which is an “upside-down tree” where the observations are placed at the bottom (leaves), clusters are represented as vertical lines, and their mergers are represented by horizontal lines placed at the level (height) of the scale corresponding to the distance between the clusters being merged. The dendrogram for the SL algorithm in our example in Fig. 7 is displayed in Fig. 8:

plot(SL_7points, main = "Single Linkage", xlab = "", sub = "", hang = -1)

Fig. 8. SL dendrogram for the data set in Fig. 7

Notice that the dendrogram is very rich in information about the data. We can see that observations 2 and 3 merge at distance 1, and these two are then merged with observation 4 that is just a little further from those. The very same can be seen with respect to (w.r.t.) observations 6 and 7, merged at distance 1, and then merged with 5 a little bit further. Then these two clusters are merged at a large distance (5), which indicates that they are far away from each other, relative to their internal distances. Finally, observation 1 is merged last, also at a large distance, which means that this observation is far away from everything else, i.e., an outlier. So, by just looking at the dendrogram we can see that there are two more prominent clusters and one outlier in the data. In cases where the data has more than two dimensions and, as such, cannot be visualised as in Fig. 7, this is very convenient. Indeed, the dendrogram is a powerful visualisation tool as it allows data sets of any dimensionality to be visualised into two dimensions.

For example, the iris data set has 4 dimensions, if we ignore the class labels (5th variable, Species). Fig. 9 shows the SL dendrogram for this data set.

DMatrix <- dist(iris[1:4])
SL_iris <- hclust(DMatrix, method = "single")
plot(SL_iris, main = "Single Linkage", xlab = "", sub = "", hang = -1, labels = FALSE)

Fig. 9. SL dendrogram for the iris data set

The dendrogram in Fig. 9 clearly suggests that there are two more prominent clusters in the data, far away from each other relative to their internal distances. If we display their class labels (5th variable, Species, which has not been used for clustering!) at the leaves of the dendrogram, we can see that, as expected from our previous analyses of this data set, one of the clusters correspond to class Setosa, which is well separated from the others, while the other cluster corresponds to a merger of classes Virginica and Versicolor, which overlap each other. This can be seen in Fig. 10.

plot(SL_iris, main = "Single Linkage", xlab = "", sub = "", hang = -1, labels = iris$Species, cex = 0.6)

Fig. 10. SL dendrogram for the iris data set (class labels displayed)

Exercise: Compute and plot the SL dendrogram for the data set in Fig. 1. Do you see 4 more noticeable clusters, two of which are closer to which other, as well as some outliers to those clusters, which is exactly what you see by looking directly at the data in Fig. 1?

The Complete-Linkage Algorithm

In the Complete Linkage (CL) algorithm, rather than taking the smallest pairwise distance as a measure of dissimilarity between two clusters, we take the largest. Recall our example in Fig. 7, and suppose once again that in a certain iteration of the algorithm we have the partition $\mathbf{P}_3 =\{ (\mathbf{x}_1), (\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4), (\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)\}$. According to CL, the distance between $(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4)$ and $(\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)$ is the largest pairwise distance between these two clusters, which in case of Euclidean distance is $d(\mathbf{x}_2,\mathbf{x}_7) = 8.944$. Similarly, the CL distance between $(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4)$ and $(\mathbf{x}_1)$ is $d(\mathbf{x}_4,\mathbf{x}_1) = 5.408$, and the CL distance between $(\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)$ and $(\mathbf{x}_1)$ is $d(\mathbf{x}_7,\mathbf{x}_1) = 9.055$. Therefore, the closest clusters in this iteration are $(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4)$ and $(\mathbf{x}_1)$. These clusters are merged, resulting in a new partition with 2 clusters, namely $\mathbf{P}_2 =\{ (\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4), (\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)\}$, and the algorithm proceeds.

The dendrogram for the CL algorithm in our example in Fig. 7 is displayed in Fig. 11:

CL_7points <- hclust(DistMatrix, method = "complete")
plot(CL_7points, main = "Complete Linkage", xlab = "", sub = "", hang = -1)

Fig. 11. CL dendrogram for the data set in Fig. 7

Exercise: Compute and plot the CL dendrogram for the data set in Fig. 1.

Exercise: Compute and plot the CL dendrogram for the iris data set. Note: Use only the numerical variables iris[1:4] for the clustering. Use the 5th variable Species only to display the class labels at the bottom of the dendrogram.

The Average-Linkage Algorithm

In the Average Linkage (AL) algorithm, rather than taking the smallest or largest pairwise distance as a measure of dissimilarity between two clusters, we take the average of the pairwise distances. Recall our example in Fig. 7, and suppose once again that in a certain iteration of the algorithm we have the partition $\mathbf{P}_3 =\{ (\mathbf{x}_1), (\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4), (\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)\}$. According to AL, the distance between $(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4)$ and $(\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)$ is the average pairwise distance between these two clusters, which in case of Euclidean distance is:

\[\displaystyle \frac{d(\mathbf{x}_2,\mathbf{x}_5) + d(\mathbf{x}_2,\mathbf{x}_6) + d(\mathbf{x}_2,\mathbf{x}_7) + d(\mathbf{x}_3,\mathbf{x}_5) + d(\mathbf{x}_3,\mathbf{x}_6) + d(\mathbf{x}_3,\mathbf{x}_7) + d(\mathbf{x}_4,\mathbf{x}_5) + d(\mathbf{x}_4,\mathbf{x}_6) + d(\mathbf{x}_4,\mathbf{x}_7)}{9} = 7.042\]

Similarly, the AL distance between $(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4)$ and $(\mathbf{x}_1)$ is:

\[\displaystyle \frac{d(\mathbf{x}_2,\mathbf{x}_1) + d(\mathbf{x}_3,\mathbf{x}_1) + d(\mathbf{x}_4,\mathbf{x}_1)}{3} = 5.297\]

Finally, the AL distance between $(\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)$ and $(\mathbf{x}_1)$ is:

\[\displaystyle \frac{d(\mathbf{x}_5,\mathbf{x}_1) + d(\mathbf{x}_6,\mathbf{x}_1) + d(\mathbf{x}_7,\mathbf{x}_1)}{3} = 8.092\]

D <- as.matrix(dist(dat_7points[1:2]))
( (D[2,5]+D[2,6]+D[2,7]+D[3,5]+D[3,6]+D[3,7]+D[4,5]+D[4,6]+D[4,7])/9 )

## [1] 7.042099

( (D[2,1]+D[3,1]+D[4,1])/3 )

## [1] 5.297504

( (D[5,1]+D[6,1]+D[7,1])/3 )

## [1] 8.092184

Therefore, the closest clusters in this iteration are $(\mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4)$ and $(\mathbf{x}_1)$. These clusters are merged, resulting in a new partition with 2 clusters, namely $\mathbf{P}_2 =\{ (\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3, \mathbf{x}_4), (\mathbf{x}_5, \mathbf{x}_6, \mathbf{x}_7)\}$, and the algorithm proceeds.

The dendrogram for the AL algorithm in our example in Fig. 7 is displayed in Fig. 12:

AL_7points <- hclust(DistMatrix, method = "average")
plot(AL_7points, main = "Average Linkage", xlab = "", sub = "", hang = -1)

Fig. 12. AL dendrogram for the data set in Fig. 7

Exercise: Compute and plot the AL dendrogram for the data set in Fig. 1.

Exercise: Compute and plot the AL dendrogram for the iris data set. Note: Use only the numerical variables iris[1:4] for the clustering. Use the 5th variable Species only to display the class labels at the bottom of the dendrogram.

Ward’s Algorithm

The Ward’s method also follows the generic HAC algorithm, but the distance between clusters used by the algorithm is not as trivial as those used by SL, CL and AL. Ward’s defines the distance between two clusters basically as the increment that would be observed in the total within cluster variance — essentially, the SSE in Eq. (1) — if those clusters were merged. If two clusters are merged, the variance of the resulting cluster will not be smaller than the variance taking the two individual clusters separately. Since in each iteration of the HAC algorithm we want to merge the closest pair of clusters, by defining the distance between clusters as the increment in variance resulting from their merger Ward’s will always merge the pair of clusters that minimises this increment. From this perspective, Ward’s and k-Means exhibit some resemblance.

The dendrogram for the Ward’s algorithm in our example in Fig. 7 is displayed in Fig. 13:

Ward_7points <- hclust(DistMatrix, method = "ward.D2")
plot(Ward_7points, main = "Ward's", xlab = "", sub = "", hang = -1)

Fig. 13. Ward’s dendrogram for the data set in Fig. 7

Exercise: Compute and plot Ward’s dendrogram for the data set in Fig. 1.

Exercise: Compute and plot Ward’s dendrogram for the iris data set. Note: Use only the numerical variables iris[1:4] for the clustering. Use the 5th variable Species only to display the class labels at the bottom of the dendrogram.

SL, CL, AL, and Ward’s: Pros and Cons

When compared with a partitional algorithm such as k-Means, HACs provide much more information, which can be nicely visualised and interpreted as a dendrogram. Besides, HACs don’t require any critical parameter, such as the number of clusters $k$ in k-Means, and only require pairwise distances between observations to operate. If a partition is desired, there is a variety of criteria that can be applied to automatically select the best $k$ (so-called cut through the dendrogram), such as, for instance, the SWC criterion. These are the main pros of HACs.

On the other hand, there are two major cons. The first one is computational: HACs operate with pairwise distances, so at least $O(N^2)$ operations and memory units are required (in contrast to $O(N)$ in k-Means), where $N$ is the number of observations (data set size). For large data sets, this best case quadratic (possibly cubic) computational burden may be prohibitive, and the analyst may be forced to work with a sub-sample of the data, or adopt more sophisticated, possibly approximate implementations of the algorithms.

The second major burden is the choice of a particular HAC instance, such as SL, CL, AL or Ward’s. In this case, there are some guidelines. SL is the only algorithm capable of identifying arbitrarily shaped clusters, the other algorithms favor volumetric, globular clusters such as the ones in Fig. 1. However, SL is very sensitive to noise, especially “background noise” that may create artificial “bridges” between clusters causing the chaining-like linkage of SL to perform undesired mergers. CL prevents the bridging effect of SL, but it is sensitive to outliers (which may severely distort the largest distance between observations used by CL) and exhibits a trend to break very large clusters. AL and Ward’s are usually deemed the most stable and reliable methods overall, but they cannot detect arbitrarily shaped clusters that can ideally be detected by SL (if the data is noise-free).

Density-Based Clustering

From the discussion above, one may wonder whether it is possible to have a clustering algorithm that can detect arbitrarily shaped clusters while being robust to noise. Density-based clustering algorithms do just that. These algorithms detect clusters as regions of higher density of observations separated by regions of lower density (see for example the data set DS3 from package dbscan in Fig. 14-a, where there are six most prominent density-based clusters embedded in background noise).

The DBSCAN Algorithm

The most well-known density-based clustering algorithm is DBSCAN. The conceptual idea behind the DBSCAN algorithm is very simple. Essentially, for each observation $\mathbf{x}_i$, the algorithm determines how many observations fall within a distance of at most $eps$ from $\mathbf{x}_i$, i.e., how many observations there are in the $eps$-neighborhood of $\mathbf{x}_i$. In a two dimensional space (a plane), when the Euclidean distance is used, the $eps$-neighborhood of $\mathbf{x}_i$ is just the region within a circle of radius $eps$ centered at $\mathbf{x}_i$. If there are at least $minPts$ observations in the neighborhood of $\mathbf{x}_i$, then $\mathbf{x}_i$ is considered a core point.

So, a core point is an observation that is dense in the sense that it has $minPts$ or more observations in its vicinity and, therefore, it satisfies the minimum density threshold. Any core point is part of a cluster. Core points that fall within each other’s $eps$-neighbourhood are said to be connected and, therefore, they are part of the same cluster. Clusters are maximal sets of connected core points. Observations that are not core (i.e. not dense) but are within the $eps$-neighborhood of a core point, are also incorporated into the corresponding cluster. These are called border points, because they are usually located around the border of clusters. The remaining observations (which are neither core nor border) are deemed noise; these are treated as outliers and, therefore, they are not included into any cluster.

The following code runs DBSCAN on data set DS3 (Fig. 14-a) using function dbscan() from library dbscan. It sets the algorithm’s parameters to $eps = 11$ and $minPts = 25$.

set.seed(0)
library(dbscan)
data("DS3")
dbs_b <- dbscan(DS3, eps = 11, minPts = 25)
dbs_b

## DBSCAN clustering for 8000 objects.
## Parameters: eps = 11, minPts = 25
## The clustering contains 6 cluster(s) and 665 noise points.
## 
##    0    1    2    3    4    5    6 
##  665 1806  651  976 1677  652 1573 
## 
## Available fields: cluster, eps, minPts

dbs_b$cluster[1:30]

##  [1] 1 2 1 2 3 0 1 1 3 4 4 1 1 4 1 5 5 6 4 6 4 1 2 6 5 0 5 3 1 3

Variable cluster stores the resulting cluster label for each observation (only the labels for the 30 first observations are displayed above). If two or more observations have the same non-zero label (1 to 6 in the above example), then those observations belong to the same cluster with that label. A label zero (0) means that the corresponding observation was identified as an outlier (noise), hence it was not clustered.

Notice that 6 clusters have been identified, and 665 observations have been left unclustered as noise. The result is shown in Fig. 14-b, where clusters are displayed in different colours and noise observations are displayed in black. This is a very good result, which could not be obtained with any of the other algorithms previously discussed in this module. However, this result was obtained with a very particular combination of values for $minPts$ and $eps$, carefully selected with a trial-and-error procedure that was only possible because this data set (and the clustering results) can be visualised in two dimensions. In practical applications of clustering involving $n$-dimensional data, however, this is not possible. So, how to set the algorithm’s parameters?

Nowadays, there are theoretical grounds to argue that parameter $minPts$ is not critical in the sense that small changes in this parameters are not expected to cause significant changes in the results. In other words, it is expected that similar results are obtained for a range of values of $minPts$.

Unfortunately, the same cannot be said about $eps$. This is a real-valued parameter (as opposed to $minPts$, which is an integer) that is very critical in the sense that minor changes in its value are not unlikely to cause major changes in DBSCAN’s results. In other words, DBSCAN is very sensitive to $eps$. To show this, we will keep the value of $minPts$ fixed at 25 and slightly change $eps$ from 11 to 10, and then from 10 to 9, in our previous example with the DS3 data. The results are shown in Fig. 14-c and 14-d, respectively.

dbs_c <- dbscan(DS3, eps = 10, minPts = 25)
dbs_c

## DBSCAN clustering for 8000 objects.
## Parameters: eps = 10, minPts = 25
## The clustering contains 8 cluster(s) and 817 noise points.
## 
##    0    1    2    3    4    5    6    7    8 
##  817 1781  633  958  299 1247  622 1554   89 
## 
## Available fields: cluster, eps, minPts

dbs_d <- dbscan(DS3, eps = 9, minPts = 25)
dbs_d

## DBSCAN clustering for 8000 objects.
## Parameters: eps = 9, minPts = 25
## The clustering contains 21 cluster(s) and 1548 noise points.
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 1548  649  252  713 1025  300  369   63  514  790  287  216  250   89  158 
##   15   16   17   18   19   20   21 
##  107  467   41   91   36   18   17 
## 
## Available fields: cluster, eps, minPts

Notice that when $eps$ is set to 10, one of the clusters is broken apart into 3 sub-clusters, thus the solution changes from 6 to 8 clusters. When $eps$ is further reduced to 9, several clusters are fragmented into a total of 21 sub-clusters, a very poor solution for this data set.

Parameter $minPts$, in contrast, can be varied in a wide range around 25 without major changes being observed in the results. This can be seen in Fig. 14-e and 14-f, where $eps$ has been fixed at 11 whereas $minPts$ has been changed (from 25) to 21 and 29, respectively. Note that, apart from the number of outliers, the solutions are not significantly different, noticeably recovering the 6 main clusters in the data very well.

dbs_e <- dbscan(DS3, eps = 11, minPts = 21)
dbs_e

## DBSCAN clustering for 8000 objects.
## Parameters: eps = 11, minPts = 21
## The clustering contains 6 cluster(s) and 594 noise points.
## 
##    0    1    2    3    4    5    6 
##  594 1819  660  988 1691  660 1588 
## 
## Available fields: cluster, eps, minPts

dbs_f <- dbscan(DS3, eps = 11, minPts = 29)
dbs_f

## DBSCAN clustering for 8000 objects.
## Parameters: eps = 11, minPts = 29
## The clustering contains 6 cluster(s) and 755 noise points.
## 
##    0    1    2    3    4    5    6 
##  755 1795  640  963 1650  633 1564 
## 
## Available fields: cluster, eps, minPts

par(mfrow = c(3, 2))
color_1to8 <- function(x) ifelse(x==0,1,((x-1)%%7)+2)
plot(DS3, pch=19, cex=0.5, main="(a) DS3 Data from the 'dbscan' Library", xlab="x1", ylab="x2")
plot(DS3, pch=19, cex=0.5, col=color_1to8(dbs_b$cluster), main="(b) DBSCAN DS3 data [minpts = 25, eps = 11]: 6 clusters", xlab="x1", ylab="x2")
plot(DS3, pch=19, cex=0.5, col=color_1to8(dbs_c$cluster), main="(c) DBSCAN DS3 data [minpts = 25, eps = 10]: 8 clusters", xlab="x1", ylab="x2")
plot(DS3, pch=19, cex=0.5, col=color_1to8(dbs_d$cluster), main="(d) DBSCAN DS3 data [minpts = 25, eps = 9]: 21 clusters", xlab="x1", ylab="x2")
plot(DS3, pch=19, cex=0.5, col=color_1to8(dbs_e$cluster), main="(e) DBSCAN DS3 data [minpts = 21, eps = 11]: 6 clusters", xlab="x1", ylab="x2")
plot(DS3, pch=19, cex=0.5, col=color_1to8(dbs_f$cluster), main="(f) DBSCAN DS3 data [minpts = 29, eps = 11]: 6 clusters", xlab="x1", ylab="x2")

Fig. 14. DBSCAN solutions for the DS3 data set, with different values of minpts and eps

Exercise: Use trial-and-error to determine a suitable combination of values for $minPts$ and $eps$ that allows DBSCAN to reasonably recover the 4 clusters in the data set of Fig. 1. Plot your solution with the detected clusters in different colours and the noise points in black.

It is worth remarking that, even though there exist some heuristics that attempt to estimate a good value for $eps$, these heuristics are very limited in the sense that they only work in ideal scenarios where a number of restrictive assumptions hold true, which is hardly the case in practice. Rather than discussing those, we will briefly present in the sequel HDBSCAN*, a modern successor to DBSCAN that does not require $eps$. HDBSCAN* was co-invented by the designer of this subject and creator of its original content (Prof. Ricardo J. G. B. Campello), as well as one of the co-inventors of DBSCAN (Prof. Joerg Sander).

The HDBSCAN* Algorithm

HDBSCAN* is a state-of-the-art density-based clustering algorithm (Campello, Moulavi, and Sander 2013), which has also been extended as a complete framework for unsupervised and semi-supervised data clustering, outlier detection, and advanced visualisations (Campello et al. 2015). Here, we focus on the core unsupervised clustering algorithm, as well as basic visualisations that are available in R.

HDBSCAN* is a hierarchical version of DBSCAN. For a given value of $minPts$, it computes in an efficient way (equivalent to computing a minimum spanning tree of a graph) a hierarchy of DBSCAN-like partitions for all possible values of $eps \in [0, \infty)$. Each hierarchical level corresponds precisely to a DBSCAN* solution (DBSCAN without the border points, which can be added afterwards if desired) associated with the range of $eps$ values that produce that unique solution. It is fascinating that this myriad of DBSCAN-like solutions can be computed automatically without the need to specify $eps$, possibly as fast as a single run of DBSCAN (not including the visualisations).

Even more interesting, the HDBSCAN* hierarchy can be post-processed to produce the so-called simplified hierarchy, which is a summarised cluster tree capturing only the most prominent density-based clusters in the data. The following code computes HDBSCAN* and plots the simplified hierarchy for the DS3 data set (note that only $MinPts$ has been specified):

hdbs <- hdbscan(DS3, minPts = 25)
plot(hdbs)

Fig. 15. HDBSCAN* simplified cluster tree for the DS3 data set ($minPts$ = 25)

Clearly, there are two major clusters, that break into two sub-clusters each, one of which is further slit into three sub-clusters (totaling six clusters at this stage), which are then fragmented and eventually vanish away as noise. As opposed to a traditional dendrogram, which would have thousands of branches hard to visualise and interpret as this data set has 8000 observations, the HDBSCAN* simplified cluster tree focuses on revealing the most prominent clustering structures only. Notice that, as opposed to traditional dendrograms, the horizontal lines representing clusters have different thicknesses, that are proportional to the number of dense (core) objects in the respective clusters at the corresponding level of the scale (height $eps$). Since objects may become noise below a certain value of $eps$, clusters may “shrink” as we move downwards in the tree (i.e., as $eps$ decreases). That’s why the thickness of the horizontal lines may narrow down as we move top-bottom, possibly vanishing away when all objects of a cluster become noise (the leaves of the tree).

If desired, solutions corresponding to specific hierarchical levels can be extracted from the HDBSCAN* hierarchy in a way that is analogous to the horizontal cuts through dendrograms previously discussed in this module. For details on how this can be achieved using function cutree() in R, we refer the reader to the dbscan::hdbscan() vignette, which is part of the CRAN R project.

The reason why we won’t discuss horizontal cuts in the context of HDBSCAN* is that the algorithm comes with an optional built-in post-processing optimization routine — called FOSC (Campello et al. 2013) — that automatically extracts a partition from the clustering hierarchy applying non-horizontal cuts based on statistical quality (stability) scores assigned to clusters. Such local, non-horizontal cuts allow the algorithm to find solutions corresponding to multiple $eps$ values in different regions of the data. This makes HDBSCAN* able to detect clusters of highly varied densities, which is not possible with DBSCAN, as DBSCAN relies on a single (global) density threshold $eps$.

The optimal partition is readily available in the result provided by the hdbscan() function. It is visualised in Fig. 16 for the DS3 data:

hdbs

## HDBSCAN clustering for 8000 objects.
## Parameters: minPts = 25
## The clustering contains 6 cluster(s) and 815 noise points.
## 
##    0    1    2    3    4    5    6 
##  815 1585  617  614  924 1656 1789 
## 
## Available fields: cluster, minPts, cluster_scores,
##                   membership_prob, outlier_scores, hc

plot(DS3, pch=19, cex=0.5, col=color_1to8(hdbs$cluster), main="HDBSCAN* DS3 data (minpts = 25): 6 clusters", xlab="x1", ylab="x2")

Fig. 16. HDBSCAN* (optional) partition for the DS3 data set ($minPts$ = 25)

Notice that a traditional dendrogram can also be drawn from the HDBSCAN* hierarchy if desired, by plotting variable hc (as shown below for the iris data), but this type of visualisation is only doable for relatively small data sets:

hdbs <- hdbscan(iris[1:4], minPts = 5)
hdbs

## HDBSCAN clustering for 150 objects.
## Parameters: minPts = 5
## The clustering contains 2 cluster(s) and 0 noise points.
## 
##   1   2 
## 100  50 
## 
## Available fields: cluster, minPts, cluster_scores,
##                   membership_prob, outlier_scores, hc

plot(hdbs$hc, main="HDBSCAN* Hierarchy", xlab = "", sub = "", hang = -1, labels = iris$Species, cex = 0.6)

Fig. 17. HDBSCAN* traditional dendrogram for the iris data set ($minPts$ = 5)

Exercise: Plot the HDBSCAN* simplified cluster tree for the iris data set.

Exercise: Compute HDBSCAN* for the data set in Fig. 1, with $minPts = 10$, and : (a) Plot the simplified cluster tree; (b)Plot the extracted partition (similarly to Fig. 16, with clusters in colours and noise in black).

hdbs <- hdbscan(dat, minPts = 10)
plot(hdbs)

hdbs

## HDBSCAN clustering for 350 objects.
## Parameters: minPts = 10
## The clustering contains 4 cluster(s) and 28 noise points.
## 
##   0   1   2   3   4 
##  28  51  71 100 100 
## 
## Available fields: cluster, minPts, cluster_scores,
##                   membership_prob, outlier_scores, hc

plot(dat, pch=19, cex=0.5, col=color_1to8(hdbs$cluster), main="HDBSCAN* Data Set Fig. 1 (minpts = 10): 4 clusters", xlab="x1", ylab="x2")

Activities

Additional Computer Practice (LAB)

Utilising R/RStudio, reproduce the experimental procedures and results described in Sections 10.5 and 10.6.2 of (James et al. 2013), “Lab 2: Clustering” and “Lab 3: Clustering the Observations of the NCI60 Data”. The latter involves the cancer cell line microarray data, NCI60, from the ISLR package.

Assignment 2

Bioinformatics application: Clustering gene-expression data sets (see separate assignment sheet).

References

Campello, R.J.G.B., D. Moulavi, and J. Sander. 2013. “Density-Based Clustering Based on Hierarchical Density Estimates.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 160–72.

Campello, R.J.G.B., D. Moulavi, A. Zimek, and J. Sander. 2013. “A Framework for Semi-Supervised and Unsupervised Optimal Extraction of Clusters from Hierarchies.” Data Mining and Knowledge Discovery 27 (3): 344–71.

———. 2015. “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 10 (1): 5:1–5:51.

James, G., D. Witten, Hastie T., and R. Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. Springer.

Tan, P.-N., M. Steinbach, and V. Kumar. 2006. Introduction to Data Mining. Addison Wesley.

Introduction to Data Mining - Week 4

Ricardo Campello

15 March 2018