Recall that in probabilistic models for classification we usually aim at predicting \(P(Y = y | X = x)\), which is the posterior probability that the dependent variable \(Y\) belongs to class \(y\) given known values \(x\) for the predictor(s) \(X\). We learned that Naive Bayes, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) use an indirect approach to estimate \(P(Y = y | X = x)\), based on the Bayes theorem. In the following, we discuss two classic models for classification that estimate \(P(Y = y | X = x)\) in a direct rather than indirect way: Logistic Regression and K-Nearest Neighbours.
Although there are extensions of logistic regression to more general classification problems, the most common model subsumes a classification setting where the dependent variable is binary (i.e. there are two classes only) and the predictors are all numeric. In the following discussion, we consider this setting. In particular, we assume that there is a positive class (+), which is a class of particular interest for the analysis, and a negative class (\(-\)). In health applications, for instance, the positive class is commonly associated with samples testing positive for a certain exam, usually determining a positive diagnose for a certain disease, whereas the negative class is typically the “normal” or “healthy” class.
In our notation, \(X = x\) means that the \(n\) predictors, \(X = \{ X_{1}, \cdots, X_{n} \}\), are taking particular numerical values, \(x = \{x_1, \cdots, x_n\}\), i.e., \(X_1 = x_1, X_2 = x_2, \cdots, X_n = x_n\). The dependent variable, in turn, can assume two different class values only, i.e. \(Y \in \{+,-\}\). For mathematical convenience, we can re-code the class values numerically as \(+1\) for the positive class and \(-1\) for the negative class. By doing this, any observation is supposed to have a class label \(Y = y\), where \(y \in \{+1,-1\}\).
Since the classes are now discriminated by the sign of the dependent variable (positive or negative unit value), a very first, simple idea to build a classifier is to model \(Y\) as a linear (affine) function of the predictors, i.e., \(\hat{Y} = \hat{f}(X) = \beta_0 + \beta_1 \cdot X_1 + \cdots + \beta_n \cdot X_n\). This way, given a training data set \((x(i), y(i))_{i=1}^{N}\) with \(N\) observations, we could train the model coefficients (\(\beta_0, \beta_1, \cdots, \beta_n\)) such that \(\hat{Y}\) approximates the true class labels \(y(i) \in \{+1,-1\}\) as accurately as possible. Once this model has been trained, given a new observation \(i\) for which only the predictor values are available, i.e. \(x(i)\), we can predict its unknown class label \(y(i)\) as:
Equation (1): \[ Y = y(i) = \left \{ \begin{array}{l} +1 {\rm ~~(positive~~class)~~ if~~} \hat{Y} = \hat{f}(X = x(i)) = \beta_0 + \beta_1 \cdot x_1(i) + \cdots + \beta_n \cdot x_n(i) > 0 \\ -1 {\rm ~~(negative~~class)~~ if~~} \hat{Y} = \hat{f}(X = x(i)) = \beta_0 + \beta_1 \cdot x_1(i) + \cdots + \beta_n \cdot x_n(i) < 0 \end{array} \right. \]
As an illustrative example, let us build an artificial data set with 20 observations, 10 from each class (positive, +1, or negative, -1), described by a single predictor, x.
set.seed(0)
x0_10 <- runif(10, min=0, max=60); x11_20 <- runif(10, min=40, max=100)
y0_10 <- rep(-1,10); y11_20 <- rep(+1,10)
( dat <- data.frame(x=c(x0_10,x11_20), y=c(y0_10,y11_20)) )
## x y
## 1 53.80183 -1
## 2 15.93052 -1
## 3 22.32743 -1
## 4 34.37120 -1
## 5 54.49247 -1
## 6 12.10092 -1
## 7 53.90338 -1
## 8 56.68052 -1
## 9 39.64787 -1
## 10 37.74684 -1
## 11 43.70718 1
## 12 52.35847 1
## 13 50.59341 1
## 14 81.22137 1
## 15 63.04622 1
## 16 86.19049 1
## 17 69.86195 1
## 18 83.05711 1
## 19 99.51437 1
## 20 62.80211 1
Notice that observations from the negative (positive) class tend to have smaller (larger) values of x, as displayed in Fig. 1.
library(ggplot2)
class <- as.factor(dat$y)
( g <- ggplot() + geom_point(data=dat, mapping=aes(x=x, y=y, shape=class)) )
Fig. 1. Illustrative binary classification example: data
By fitting a linear model to the data and performing the classification as described above, we get the result in Fig. 2. The straight line is the output of the regression model, \(\hat{Y}\). The colours represent the predicted labels according to Eq. (1): light blue observations are classified as positive (\(\hat{Y} > 0\)), while black observations are classified as negative (\(\hat{Y} < 0\)). Therefore, light blue circles would be false positives, while black triangles would be false negatives according to the linear classifier.
linear_model <- lm(y ~., data=dat)
new_dat <- data.frame(x=seq(from=0, to=100, by=0.1))
Y_hat_new_dat <- predict.lm(linear_model, newdata=new_dat)
g2 <- g + geom_point(data=data.frame(x=new_dat, y=Y_hat_new_dat), mapping=aes(x=x, y=y))
Y_hat_dat <- predict.lm(linear_model, newdata=NULL)
Y_dat <- sign(Y_hat_dat)
( g2 <- g2 + geom_point(data=dat, mapping=aes(x=x, y=y, shape=class, colour=Y_dat), show.legend=FALSE) )
Fig. 2. Illustrative binary classification example: linear regression classifier — Eq. (1)
Apart from a number of potential practical issues that are beyond the scope of the current discussion, the linear regression classifier described above is, in principle, a valid approach if one just wants to perform ordinary labelling of unlabelled observations into one of two candidate classes.
In logistic regression, however, we don’t just want a classifier that assigns a new observation to one of the classes. In fact, for the sake of interpretability, we do want to estimate \(P(Y = y | X = x)\). For instance, given the results (\(x\)) of a number of medical exams (\(X\)), the analyst may want to estimate the probability \(P(Y = y| X = x)\) that a patient with those results will survive (\(y = +1\)).
In its original form, the linear model \(\hat{Y} = \hat{f}(X)\) described above is not a good model for \(P(Y = y| X = x)\) because probabilities by definition must fall within the interval \([0,1]\), yet \(\hat{Y} = \hat{f}(X)\) is not bounded (\(\hat{Y} \in~ ]-\infty,\infty[\)). Logistic regression addresses this problem by “squeezing” \(\hat{Y} = \hat{f}(X)\) within \([0,1]\), using the following mathematical transformation of \(\hat{Y}\) as a model for \(P(Y = +1| X = x)\):
Equation (2): \[ \large P(Y = +1|X = x) = \frac{e^{\hat{f}(X=x)}}{1 + e^{\hat{f}(X=x)}} = \frac{1}{1 + e^{-\hat{f}(x)}}\]
Notice that, since we are considering a binary classification problem, we only need to explicitly model \(P(Y = +1| X = x)\), whereby we can implicitly derive \(P(Y = -1| X = x) = 1 - P(Y = +1| X = x)\) in a straightforward way.
Fig. 3 shows the effect of applying Eq. (2) to the linear model in Fig. 2. Notice that the colours have now been assigned using the 0.5 probability threshold, i.e., the observations have been classified as positive (light blue) if \(P(Y = +1| X = x) > 0.5\) and as negative (black) otherwise.
LR_fun <- function(v){1/(1 + exp(-v))}
P_positive_new_dat <- LR_fun(Y_hat_new_dat)
g3 <- g + geom_point(data=data.frame(x=new_dat, y=P_positive_new_dat), mapping=aes(x=x, y=y))
P_positive_dat <- LR_fun(Y_hat_dat)
Y_dat <- sign(P_positive_dat - 0.5)
( g3 <- g3 + geom_point(data=dat, mapping=aes(x=x, y=y, shape=class, colour=Y_dat), show.legend=FALSE) )
Fig. 3. Illustrative binary classification example: non-linear transformation of linear regression classifier — Eq. (2)
The classification result for the training data in Fig. 3 is exactly the same as the one in Fig. 2, the difference is that we have now an explicit model for \(P(Y = +1| X = x)\). However, looking at the model in Fig. 3 we can see that, although \(P(Y = +1| X = x)\) falls within the probability interval \([0,1]\), its shape does not align with the intuition that \(P(Y = +1| X = x)\) should approach zero (one) more quickly as we move away from around \(x = 40\) (\(x = 60\)) to the left (right), as we no longer find any positive (negative) observations lying on that extreme of the input space. The reason why the model does not fit the data so well lies on the fact that we have not taken into account Eq. (2) while fitting the parameters (\(\beta_0, \beta_1, \cdots, \beta_n\)) of \(\hat{Y} = \hat{f}(X)\). Rather, we fitted the linear model first, and only then applied Eq. (2) to the fitted model (a posteriori).
Logistic regression fits the model in Eq. (2) directly to data. Fig. 4 displays a logistic regression model trained using the data set in our example. The model is derived as a particular case (family = binomial) of the generalized linear model function glm() available from the stats package. The probabilities are obtained using function predict.glm() with setting type=response. If newdata is not provided, then predict.glm() makes predictions for the training data.
dat <- cbind(dat, class)
LR_model <- glm(formula=class~x, data=dat, family=binomial)
P_positive_new_dat <- predict.glm(LR_model, newdata=new_dat, type="response")
g4 <- g + geom_point(data=data.frame(x=new_dat, y=P_positive_new_dat), mapping=aes(x=x, y=y))
P_positive_dat <- predict(LR_model, type="response")
Y_dat <- sign(P_positive_dat - 0.5)
( g4 <- g4 + geom_point(data=dat, mapping=aes(x=x, y=y, shape=class, colour=Y_dat), show.legend=FALSE) )
Fig. 4. Illustrative binary classification example: logistic regression classifier
Notice in Fig. 4 that the classification result for the training data (applying the 0.5 probability threshold) is the same as before, however the “S-shaped” logistic regression model clearly fits the (training) data better now.
Different strategies can be used to estimate the parameters of the logistic regression model from data. A common approach, which is of particular interest here as it sheds light on certain aspects of the model interpretability, is called likelihood maximization. Basically, the idea is to maximize the total probability — or likelihood — that, under the model assumptions, the observations will be assigned their right class label, given their known values for the predictor(s). In other words, given a set of \(N\) training observations \((x(i), y(i))_{i=1}^{N}\), we want to maximize the total probability, according to the model in Eq. (2), that the class labels are in fact the observed labels \(y(i)\) for \(i = 1, \cdots, N\). Under the usual assumption that the training sample is i.i.d. (identically and independently distributed), this total probability is given by the product of the individual probabilities, \(P(Y = y(i)|X = x(i))\), that the class label for the \(i\)th observation is the correct label \(y(i)\), for every observation \(i = 1, \cdots, N\). This means that the parameters of the logistic regression model (\(\beta_0, \beta_1, \cdots, \beta_n\)) can be optimized by maximizing the following likelihood function:
Equation (3): \[\begin{array}{lcl} \large L(\beta_0, \beta_1, \ldots, \beta_n) & = & \displaystyle \prod_{i:~y(i) = +1} P(Y = +1|X = x(i)) \cdot \prod_{i:~y(i) = -1} P(Y = -1|X = x(i)) \\ & = & \displaystyle \prod_{i:~y(i) = +1} P(Y = +1|X = x(i)) \cdot \prod_{i:~y(i) = -1} \Big ( 1 - P(Y = +1|X = x(i)) \Big ) \end{array}\]
Since functions involving products of terms are mathematically more complex to optimise than functions involving sums of terms, the following alternative function, so-called log-likelihood, can be used in lieu of \(L\):
Equation (4): \[\large LOG_L(\beta_0, \beta_1, \ldots, \beta_n) := ln \big ( L(\beta_0, \beta_1, \ldots, \beta_n) \big ) = \displaystyle - \sum_{i=1}^{N} ln \Big ( 1 + e^{-y(i) \cdot \hat{f}(x(i))} \Big ) \]
In Eq. (4), “ln()” means the natural logarithm. Notice that the logarithm is a monotone transformation, which means that it does not affect the relative order of the values of \(L(\beta_0, \beta_1, \ldots, \beta_n)\). In other words, the same coefficients \(\beta_0, \beta_1, \cdots, \beta_n\) that maximise \(L(\beta_0, \beta_1, \ldots, \beta_n)\) in Eq. (3) also maximise \(LOG_L(\beta_0, \beta_1, \ldots, \beta_n)\) in Eq. (4) and, therefore, we can aim at maximising Eq. (4) instead of Eq. (3) for mathematical convenience.
Note: The first “equality” in Eq. (4) is just a definition (“:=”). The second equality follows straightforwardly from Eqs. (2), (3) as well as basic algebraic manipulations involving logarithm properties. The details are omitted here as they are not indispensable for practical purposes.
A closer look into Eq. (4) provides us a better understanding of the logistic regression model. First, notice that Eq. (4) is given by a sum of \(N\) terms, each of which depends solely on the \(i\) observation, \((x(i),y(i))\), besides the model parameters. Each such a term is so-called loss, and is displayed in Fig. 5 as a function of \(y(i) \cdot \hat{f}(x(i))\). Notice that, due to the negative sign in Eq. (4), when this term is positive for a given observation, it is actually deducted as a penalty from \(LOG_L\), which we want to maximise.
Fig. 5. Loss function of a logistic regression classifier
Notice in Fig. 5 that the loss function will certainly be positive (i.e., a penalty) when a given training observation is assigned the wrong label according to the classification rule in Eq. (1), i.e., when \(y(i) \cdot \hat{f}(x(i)) < 0\) (i.e., \(y(i) = +1\) and \(\hat{Y} = \hat{f}(x(i)) < 0\), or \(y(i) = -1\) and \(\hat{Y} = \hat{f}(x(i)) > 0\)). Moreover, since the magnitude of \(y(i)\) is always 1, it is clear that the loss increases or decreases with the magnitude of \(\hat{f}(x(i))\). The magnitude of \(\hat{f}(x(i))\) can be interpreted as “how far the classification of observation \(x(i)\) is from the decision boundary, \(\hat{f}(x(i)) = 0\)”. The farther it is “on the wrong side”, the higher the penalty, thus discouraging misclassifications, especially very confident ones. Interestingly, we can see in Fig. 5 that there is always a penalty no matter the “side”, but the penalty for observations on “the right side” is much smaller and tends asymptotically to zero as observations come to be classified correctly and more confidently.
One piece is missing in the above interpretation though. Notice that our analysis above is based on the classification rule defined in Eq. (1), but Eq. (1) is the classification rule used by the linear regression classifier. The classification rule for the binary logistic regression classifier should be instead:
Equation (5): \[ Y = y(i) = \left \{ \begin{array}{l} +1 {\rm ~~(positive~~class)~~ if~~} P(Y = +1| X = x) > 0.5 \\ -1 {\rm ~~(negative~~class)~~ if~~} P(Y = +1| X = x) < 0.5 \end{array} \right. \]
It turns out that, under the logistic regression assumptions, in particular Eq. (2), the decision rules in Eq. (1) and Eq. (5) are equivalent to each other. To show this, let’s introduce the concept of odds, which plays an important role in the interpretation of logistic regression classifiers:
Equation (6): \[\large \frac{P(Y = +1|X)}{1 - P(Y = +1|X)} = \frac{\frac{e^{\hat{f}(X)}}{1 + e^{\hat{f}(X)}}}{1 - \frac{e^{\hat{f}(X)}}{1 + e^{\hat{f}(X)}}} = e^{\hat{f}(X)}\]
Notice that the odds for a given observation \(x(i)\) is defined as the ratio between the probability that \(x(i)\) belongs to class positive and the probability that \(x(i)\) does not belong to class positive. Obviously, if the odds is greater (smaller) than 1, than the observation is more likely (less likely) to be of class positive. By taking the logarithm of the odds in Eq. (6), we get the log-odds, so-called logit:
Equation (7): \[\large ln \left ( \frac{P(Y = +1|X)}{1 - P(Y = +1|X)} \right ) = ln \left ( e^{\hat{f}(X)} \right ) = \hat{f}(X) = \beta_0 + \beta_1 \cdot X_1 + \cdots + \beta_n \cdot X_n\]
Recalling that \(ln(v) > 0\) if \(v > 1\), \(ln(v) = 0\) if \(v = 1\), and \(ln(v) < 0\) if \(v < 1\), it follows from the logit function in Eq. (7) that:
Equation (8): \[ \left \{ \begin{array}{l} P(Y = +1| X) > 0.5 {\rm ~~if~~} \hat{Y} = \hat{f}(X) = \beta_0 + \beta_1 \cdot X_1 + \cdots + \beta_n \cdot X_n > 0 \\ P(Y = +1| X) = 0.5 {\rm ~~if~~} \hat{Y} = \hat{f}(X) = \beta_0 + \beta_1 \cdot X_1 + \cdots + \beta_n \cdot X_n = 0 \\ P(Y = +1| X) < 0.5 {\rm ~~if~~} \hat{Y} = \hat{f}(X) = \beta_0 + \beta_1 \cdot X_1 + \cdots + \beta_n \cdot X_n < 0 \\ \end{array} \right. \]
The importance of this result is that it proves an important property of logistic regression classifiers: even though the logistic regression model in Eq. (2) is non-linear, the decision boundary of the resulting classifier is linear (defined by the hyperplane \(\hat{f}(X) = 0\) in the n-dimensional space of the predictor variables).
A question that may arise is then why do we need the logistic regression model if we have, for instance, Linear Discriminant Analysis (LDA), which also requires numerical predictors and also results in a linear classification boundary. One possible answer to this question is that, unlike LDA, logistic regression does not assume that the observations within each class follow a Normal distribution. If this assumption is not satisfied, LDA may perform poorly. On the other hand, logistic regression may also exhibit shortcomings in certain scenarios, most noticeably when classes are well separated (which may cause numerical instability during the optimisation of the logistic regression model, possibly because there may be many different solutions with very similar values for the loss function in this scenario).
It is important to stress that there is no single classifier that is always the best for all applications. LDA and logistic regression are two different tools in a data scientist toolbox, each of which demands specific numerical methods for training and allows different types of interpretation of the results, which may be less or more appropriate depending on the particular problem in hand.
Important Remark: it is worth remarking that there are many extensions of the basic logistic regression model discussed above, such as non-binary logistic regression (for problems with more than two classes), logistic regression with categorical predictors, kernel-based non-linear logistic regression, regularised model optimisation (with inclusion of a regularisation term to control the bias-variance trade-off), just to mention a few. More details on logistic regression can be found e.g. in (James et al. 2013) (see Recommended Reading) or in the Wikipedia Entry for this topic.
As a practical example of logistic regression, let’s consider the CRX data set, which is publicly available at the UCI data repository. This data set contains data regarding corporate MasterCard (credit card) applications from the Commonwealth Bank during 1984. This was a time when credit approvals was done manually, and research into automation was active to improve equity and accuracy into the credit approval process.
Data <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data",
header = FALSE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "",
na.strings = "?")
In the public repository above, the variable names have been removed for confidentiality reasons; however, we provide the variable names below, based on the original publication (Carter and Catlett 1986):
names(Data) <- c("Gender", "Age", "MonthlyExpenses", "MaritalStatus", "HomeStatus", "Occupation",
"BankingInstitution", "YearsEmployed", "NoPriorDefault", "Employed", "CreditScore",
"DriversLicense", "AccountType", "MonthlyIncome", "AccountBalance", "Approved")
We will remove records (rows) with missing values and then randomly split the remaining data set into a training set (with 80% of the observations) and a test set (with the remaining 20% of the observations). Besides, for the sake of simplicity, in the current example/exercise we will only use the numerical variables, namely, Age, MonthlyExpenses, YearsEmployed, CreditScore, MonthlyIncome, and AccountBalance as predictors for Approved (the outcome of a credit card application) as the dependent variable:
Data <- na.omit(Data) # Remove rows with NA values
numeric_predictors <- c("Age","MonthlyExpenses","YearsEmployed","CreditScore","MonthlyIncome","AccountBalance")
set.seed(0)
no_obs <- dim(Data)[1]
test_index <- sample(no_obs, size = as.integer(no_obs*0.2), replace = FALSE) # 20% data records for test
test_predictors <- Data[test_index, numeric_predictors]
test_class_labels <- Data[test_index, "Approved"]
training_index <- -test_index # 80% data records for training
training_dat <- Data[training_index, c(numeric_predictors, "Approved")]
Exercise: Fit a logistic regression model to the training data (training_dat) using function glm() from the stats package, and name the resulting model as LR_CRX. Print a summary of your model with summary(LR_CRX). Then, perform classification of the test data (test_predictors) using function predict.glm(), applying the 0.5 probability threshold to get a binary classification result. Compute classification accuracy.
If the exercise is done properly, your results should look like:
accuracy
## [1] 0.8153846
summary(LR_CRX)
##
## Call:
## glm(formula = Approved ~ ., family = binomial, data = training_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0076 -0.7813 -0.6597 0.8136 1.8662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2829835 0.3472026 -3.695 0.000220 ***
## Age -0.0020246 0.0097876 -0.207 0.836128
## MonthlyExpenses 0.0190405 0.0227419 0.837 0.402455
## YearsEmployed 0.2107331 0.0457894 4.602 4.18e-06 ***
## CreditScore 0.2824140 0.0441814 6.392 1.64e-10 ***
## MonthlyIncome -0.0006172 0.0006577 -0.938 0.347989
## AccountBalance 0.0004331 0.0001145 3.784 0.000155 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 720.80 on 522 degrees of freedom
## Residual deviance: 533.83 on 516 degrees of freedom
## AIC: 547.83
##
## Number of Fisher Scoring iterations: 7
Note that column Pr(>|z|) of Coefficients return p-values that indicate the significance of each variable for the model. Smaller p-values, indicated by stars (e.g. ***), suggest more relevant variables.
Exercise: Repeat the previous exercise using precisely the same training and test data sets, but now using only variables YearsEmployed, CreditScore, and AccountBalance as predictors in the logistic regression model. Compare the results.
K-Nearest Neighbours (KNN) adopts a classification strategy that is very peculiar in that, unlike LDA, QDA, logistic regression, and other classifiers based on some sort of explicit parametric model, it uses instead a non-parametric approach to directly estimate the probabilities \(P(Y = y | X = x)\) from data. No explicit model is derived. Rather, the training database needs to be queried whenever a new observation needs to be classified.
The KNN strategy is very simple and intuitive. It makes the loose assumption that similar observations tend to belong to the same class. Under this assumption, given an observation \(x\) to be classified, we can look at observations “nearby” \(x\) in the feature space (i.e. the space of predictor variables) and find the predominant class. From a probabilistic perspective, if out of \(k\) observations nearby \(x\) there are \(N_y\) observations from a given class \(y\), then we estimate \(P(Y = y | X = x)\) as \(N_y/k\). Therefore, the predominant class “nearby” \(x\) is the class for which \(P(Y = y | X = x) = N_y/k\) is maximum. This is the class \(x\) is most likely to belong to based on this simple, frequentist probability estimate.
In order to define “nearby”, KNN needs a (dis)similarity measure, which is used to compute the proximity of the observation \(x\) in question to all the observations in the training database. Common (dis)similarity measures in practice are, for instance, Euclidean distance, Pearson correlation, and Cosine similarity, just to mention a few. The most appropriate measure strongly depends on the application scenario though.
The (dis)similarity measure allows KNN to select the \(k\) observations from the database that are the closest (most similar) to \(x\). This is a common operation in database systems, called KNN query. Once the neighbours have been retrieved, we can count the frequency \(N_y\) for each class \(y\), and then compute \(P(Y = y | X = x) = N_y/k\). The target observation, \(x\), is usually assigned to the most likely class, which corresponds to the majority vote. For instance, in a binary classification setting with \(k = 5\), if four out of the five nearest neighbours of a given observation \(x\) belong to class positive, whereas one belong to class negative, then \(P(Y = +| X = x) = 4/5\), \(P(Y = -| X = x) = 1/5\), and \(x\) is then assigned to class positive.
Notice that \(k\) is a user-defined parameter, which ultimately allows the analyst to control the bias-variance trade-off. When \(k\) is small, the classification decision will be based on a very few observations, which tend to be in the vicinity of the query observation \(x\), so the classifier will be flexible enough to discriminate between different classes even around the class boundaries when these boundaries are highly non-linear. Flexibility, however, comes with a price: the classifier is more prone to overfitting. Increasing \(k\) reduces the risk of overfitting, as our probability estimates are based on a larger sample. However, by increasing \(k\) the classifier will rely on observations farther from \(x\), becoming less “local” and, therefore, less flexible. In other words, we tend to reduce variance risking to increase bias. In the extreme, if we set \(k = N\) (the size of the data set), any new observation would be assigned the label of the majority class in the data.
As an illustrative example, let’s consider once again data set iris from base R (datasets package). We have limited the data set to two dimensions for the sake of visualisation. We focus on a particular observation, \(x = (2.5, 1.5)\), highlighted with an “x” mark at the center of a circle in Fig. 6. We vary the radius of the circle to indicate the neighbourhood of \(x\) for \(k\) equal to 1, 3, 5, and 7 (with Euclidean distance). Notice that, for \(k = 1\) (figure at the top), the only neighbour is blue (class virginica), so \(P(Y = blue | X = x) = 1/1 = 1\), whereas for the other two classes it follows that \(P(Y = green | X = x) = P(Y = red | X = x) = 0/1 = 0\), so the observation would be classified as blue if its class label was unknown. Increasing \(k\) to 3 (second figure) doesn’t change this decision, even though classification is now performed with less confidence, namely, as \(P(Y = blue | X = x) = 2/3\), \(P(Y = green | X = x) = 1/3\), and \(P(Y = red | X = x) = 0/3 = 0\).
Increasing \(k\) even further, to \(k=5\) or \(k=7\), makes the green class (versicolor, which is the right label in this particular case) prevail within \(x\)’s neighbourhood. For \(k = 5\) (third figure from top to bottom), it follows that \(P(Y = green | X = x) = 3/5 = 0.6\), \(P(Y = blue | X = x) = 2/5 = 0.4\), and \(P(Y = red | X = x) = 0/5 = 0\). As for \(k = 7\), there is a catch. Two observations (indicated by the black arrow at the bottom figure) are superposed, as they have exactly the same values for variables Sepal.Width and Petal.Width, so they look like only one green point in the figure. These two observations are tied with a third observation (all lying on the golden circle) as the 6th/7th/8th nearest neighbour of \(x\).
Fig. 6. KNN Example (k = 1, 3, 5, 7): iris dataset
When there are ties in the distance from \(x\) to its \(k\)th and (\(k+1\))th, (\(k+2\))th, … nearest neighbours, like in the above example for \(k=7\) and \(k+1=8\), the usual approach is to include all the tied observations in the voting, which in practice is equivalent to momentarily expand \(k\) in order to accommodate any further observation(s) at the same distance from \(x\). In our example above, this means that the probabilities would be computed as \(P(Y = green | X = x) = 6/8 = 0.75\), \(P(Y = blue | X = x) = 2/8 = 0.25\), and \(P(Y = red | X = x) = 0/5 = 0\).
In the following, we show how to perform the above experiment in R. We will use function knn() from package class as an off-the-shelf implementation of the KNN classifier with Euclidean distance. Before we proceed, however, we need to prepare our data in a suitable way, by: (a) keeping only the predictor variables Sepal.Width and Petal.Width shown in Fig. 6; and (b) putting the target observation aside from the training data as a test case.
We start by removing the undesired predictors, separating the remaining ones from the dependent (i.e. class) variable:
new_iris_predictors <- iris[, c("Sepal.Width", "Petal.Width")]
new_iris_class <- iris[, "Species"]
In order to remove the target observation \(x = (2.5,1.5)\) from the training data, we first need to find it:
( target <- which((iris$Sepal.Width==2.5)&(iris$Petal.Width==1.5)) )
## [1] 73
Once we have found it (observation 73), we can remove it from the training data set:
new_iris_test <- new_iris_predictors[target,]
new_iris_predictors <- new_iris_predictors[-target,]
new_iris_class <- new_iris_class[-target]
We can then perform KNN classification of the test observation, varying the parameter \(k\) as desired. Notice that option prob = TRUE makes knn() additionally return the proportion of votes for the winning class, whereas prob = FALSE would return only the corresponding class label:
library(class)
(knn(train = new_iris_predictors, test = new_iris_test, cl = new_iris_class, k = 1, prob = TRUE))
## [1] virginica
## attr(,"prob")
## [1] 1
## Levels: setosa versicolor virginica
(knn(train = new_iris_predictors, test = new_iris_test, cl = new_iris_class, k = 3, prob = TRUE))
## [1] virginica
## attr(,"prob")
## [1] 0.6666667
## Levels: setosa versicolor virginica
(knn(train = new_iris_predictors, test = new_iris_test, cl = new_iris_class, k = 5, prob = TRUE))
## [1] versicolor
## attr(,"prob")
## [1] 0.6
## Levels: setosa versicolor virginica
(knn(train = new_iris_predictors, test = new_iris_test, cl = new_iris_class, k = 7, prob = TRUE))
## [1] versicolor
## attr(,"prob")
## [1] 0.75
## Levels: setosa versicolor virginica
Note: If needed, you can use command attr( , "prob") to extract from the composed result only the vote proportion component, e.g.:
pred_value <- knn(train = new_iris_predictors, test = new_iris_test, cl = new_iris_class, k = 7, prob = TRUE)
attr(pred_value, "prob")
## [1] 0.75
Exercise: Compute the Euclidean distance from \(x = (2.5,1.5)\) to all other observations in the two-dimensional space of Fig. 6, then sort them in ascending order, showing that the distance from \(x\) to its 6th, 7th, and 8th nearest neighbours are the same. Hint: you can use dist() from the stats package and order() from base R to make things easier.
Let’s consider once again the data set iris from base R (datasets package), but now in its original form, with all its four predictors. We will randomly split the data set into a training set (with 80% of the observations) and a test set (with the remaining 20% of the observations):
set.seed(0)
no_obs <- dim(iris)[1] # No. of observations (150)
test_index <- sample(no_obs, size = as.integer(no_obs*0.2), replace = FALSE) # 20% data records for test
test_predictors <- iris[test_index, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
test_class <- iris[test_index, "Species"]
training_index <- -test_index # 80% data records for training
training_predictors <- iris[training_index, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
training_class <- iris[training_index, "Species"]
Then, we will use the training set and a KNN classifier with \(k = 1\) to classify the test set:
Pred_class <- knn(train=training_predictors, test=test_predictors, cl=training_class, k=1)
(cont_tab <- table(Pred_class, test_class))
## test_class
## Pred_class setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 9 1
## virginica 0 0 8
(accuracy <- sum(diag(cont_tab))/sum(cont_tab))
## [1] 0.9666667
We can see that only one out of 30 observations was misclassified, which means an accuracy of about 96%. However, notice that this may change if we select a different random split of the data into 80% for training and 20% for test.
Exercise: Instead of randomly splitting the data into a training and a test set only once, repeat this procedure 10 times (use a for loop like for(i in 1:10){# your code}) and compute the mean and the standard deviation of the classification accuracy over the 10 test sets.
Exercise: In practice, \(k\) is usually selected experimentally. Repeat the previous exercise varying \(k\) from 1 to 20 and plot a curve of mean accuracy (in the y axis) as a function of \(k\) (x axis). Use set.seed(0) in the beginning of your code for the sake of reproducibility of the results. What is the best choice of \(k\) according to this experiment?
Important Remarks: The KNN classifier is simple and intuitive, and it usually provides very competitive results when a suitable (dis)similarity measure is chosen and \(k\) is properly set, especially in data sets of low to moderate dimensionality. In high-dimensional spaces, the idea of local neighbourhood is not well behaved and the performance of KNN tends to degrade. In addition, KNN tends to be particularly sensitive to the presence of irrelevant (noise) predictors, as they tend to significantly affect (dis)similarity and, therefore, neighbourhood computations. Unlike LDA, QDA and logistic regression, KNN does not readily provide information about which predictors are more or less relevant. Notice that numerical predictors ranging within wider intervals will tend to dominate neighbourhood computations based on Euclidean and other related distance measures, so variable rescaling (e.g. within \([0,1]\)) is required.
KNN can also be applied for regression, i.e., to model a numerical (rather than categorical) dependent variable \(Y\). The change required in the previously described KNN classifier is minimal. Specifically, once the \(k\) nearest neighbours of the target observation are found, instead of class label proportions we compute the average of the dependent variable values across the neighbours. For instance, the predicted value for an observation \(x\) whose \(k=3\) nearest neighbours have values \(Y = -1\), \(Y = 1\), and \(Y =2\) is \(\hat{Y} = (-1 +1 + 2)/3 = 2/3\). Further details and illustrative examples can be found, e.g., in Section 3.5 of (James et al. 2013) (see recommended reading below.).
In order to assess the performance of classifiers for data that has not been seen during the training, the procedure adopted so far has been to randomly split the data set into a training and a test subset. This procedure is called holdout, and its main disadvantage is that the assessment outcome may strongly depend on the particular random choice of observations that go either to the training or to the test fold. To mitigate this problem, we have performed holdout multiple times, taking the average of the individual results (mean accuracy). This procedure is called random subsampling. By averaging individual results, random subsampling is expected to reduce variance and, therefore, provide a more stable and reliable estimate of the classifier’s performance over unseen data. However, it still has a noticeable shortcoming: there is no control over how many times a given observation is used for training or test. For instance, some observations may by chance end up excluded from all test subsets, while some others may be used for test multiple times.
An alternative to random subsampling is cross-validation. There are many variants of this technique, but the most common is by far K-fold cross-validation. The idea is very simple. We divide the data set into \(K\) disjoint, equal-sized subsets (called folds), each of which contains \(N/K\) observations (or as close to this as possible), where \(N\) is the data set size. Typically, the folds are randomly populated using a stratified sampling procedure that ensures that the original class proportions in the data set are approximately preserved within each fold, when possible. Sampling is performed without replacement, so the resulting folds form a partition of the data set (their union yields the data set, and their pairwise intersections are all empty). As an example, let’s consider again the toy data set used above to illustrate logistic regression:
set.seed(0)
x0_10 <- runif(10, min=0, max=60); x11_20 <- runif(10, min=40, max=100)
y0_10 <- rep(-1,10); y11_20 <- rep(+1,10)
( dat <- data.frame(x=c(x0_10,x11_20), y=c(y0_10,y11_20)) )
## x y
## 1 53.80183 -1
## 2 15.93052 -1
## 3 22.32743 -1
## 4 34.37120 -1
## 5 54.49247 -1
## 6 12.10092 -1
## 7 53.90338 -1
## 8 56.68052 -1
## 9 39.64787 -1
## 10 37.74684 -1
## 11 43.70718 1
## 12 52.35847 1
## 13 50.59341 1
## 14 81.22137 1
## 15 63.04622 1
## 16 86.19049 1
## 17 69.86195 1
## 18 83.05711 1
## 19 99.51437 1
## 20 62.80211 1
A valid \(K\)-fold stratified subdivision of this data set with \(K=10\) could be \([1, 11]\), \([2, 12]\), \([3, 13]\), \([4, 14]\), \([5, 15]\), \([6, 16]\), \([7, 17]\), \([8, 18]\), \([9, 19]\), and \([10, 20]\), as there are \(K=10\) subsets, each of which has \(N/K = 20/10 = 2\) observations, half of which are positive (and, accordingly, the other half are obviously negative). In addition, every observation in the data set appears exactly once, in one of these folds.
Once the \(K\) folds have been built, the \(K\)-fold cross-validation procedure is straightforward:
Two of the most common choices for \(K\) are \(K=10\) and \(K=N\). The former results in the well-known \(10\)-fold cross-validation procedure, which is widely adopted for classification assessment. The latter results in the leave-one-out cross-validation (LOOCV), which is also broadly used in practice.
In R, there are packages with various cross-validation functions available. Certain packages provide built-in implementations of cross-validation as part of specific classifiers. For instance, function knn.cv() from package class performs a built-in (LOOCV) cross-validated KNN classification procedure, as shown below for the iris data set:
predictors <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
class_labels <- iris[, "Species"]
Pred_class <- knn.cv(train=predictors, cl=class_labels, k=1)
(cont_tab <- table(Pred_class, class_labels))
## class_labels
## Pred_class setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 47 3
## virginica 0 3 47
(accuracy <- sum(diag(cont_tab))/sum(cont_tab))
## [1] 0.96
We can see above that, in the LOOCV setup, where each observation is set aside for test individually, one at a time, there are 6 misclassifications in total (out of 150) with the 1NN classifier.
Exercise: Repeat the previous exercise varying parameter \(k\) of the KNN classifier from 1 to 20, and plot a curve of LOOCV accuracy (in the y axis) as a function of \(k\) (x axis). What is the best choice(s) of \(k\) according to this experiment?
It is worth noticing that accuracy is not the only measure that can be evaluated in a cross-validation setup. In fact, cross-validation is a rather generic procedure that can be coupled with different measures of classification performance. A very popular classification performance measure is the Area Under the Receiver Operating Characteristic Curve, or AUC-ROC. In order to explain the AUC-ROC measure, we first need to understand what the Receiver Operating Characteristic (ROC) curve is.
In principle, ROC analysis applies to binary classification problems and is asymmetric as it focuses on one of the classes, called positive class (+), as a class of interest. In health applications, for instance, the positive class is commonly associated with samples testing positive for a certain exam, usually determining a positive diagnose for a certain disease, whereas the negative class (—) is typically the “normal” or “healthy” class.
The classic ROC curve is a two-dimensional graph with the rate of false positives (RFP) in the x-axis and the rate of true positives (RTP) in the y-axis. In order to explain these quantities, let \(N_{+}\) and \(N_{-}\) be the number of observations from class positive and negative, respectively (\(N_{+} + N_{-} = N\)). Let \(N_{TP}\) be the number of true positives, which is the number of positive observations correctly classified as positive. Finally, let \(N_{FP}\) be the number of false positives, which is the number of negative observations misclassified as positive. The rate of false positives is the fraction of negative observations that are misclassified, i.e., \(RFP = N_{FP}/N_{-}\). The rate of true positives (so-called recall) is the fraction of positive observations that are correctly classified, i.e., \(RTP = N_{TP}/N_{+}\).
It should be clear that a perfect classifier, which makes no mistakes, will be such that \(N_{FP} = 0\) and \(N_{TP} = N_{+}\), which results in \(RFP = N_{FP}/N_{-} = 0\) and \(RTP = N_{TP}/N_{+} = 1\). Therefore, the ideal classifier is represented as a point \((0,1)\) at the top left corner of the unit square in the \(RFP \times RTP\) plot. Any other classifier will be a point somewhere else in the unit square of \(RFP \times RTP\), and the better the classifier the closer it should be from the top left corner (low RFP and high RTP).
Exercise: What are the values of RFP and RTP for a “classifier” that assigns every observation to the positive class, and where is this classifier positioned in the unit square of \(RFP \times RTP\)?
Exercise: Repeat the previous exercise but now considering a “classifier” that assigns every observation to the negative class. Justify your answers.
But if a classifier is a point in the \(RFP \times RTP\) plot, what is the ROC curve about? To answer this question, we should first recall that all classifiers we have studied so far, namely, Naive Bayes, LDA, QDA, Logistic Regression, and KNN, provide a probability \(P(Y = +| X = x)\) associated with every observation \(x\) that we want to classify, rather than just a binary assignment of \(x\) to either class positive or negative. So far, we have obtained a binary assignment by applying a 0.5 threshold to this probability, such that \(x\) is classified as positive if \(P(Y = +| X = x) > 0.5\) or negative otherwise. But this threshold is arbitrary, it can be changed as desired. For instance, in a medical diagnose application, a false negative can have disastrous consequences for a patient, so we can be conservative and use a threshold of, e.g., 0.2 instead of 0.5, such that \(x\) is flagged as positive — or risky, to be further investigated — if \(P(Y = +| X = x) > 0.2\) (negative otherwise).
Notice that by varying the probability threshold we produce different binary classifications of the same data set from the probability estimates produced by a given classifier. For every threshold value we have a binary classification as a point in the \(RFP \times RTP\) plot. By varying the threshold from 0 to 1 we move from the extreme conservative case in which all observations are classified as positive, from the least conservative one where all observations are classified as negative. These extreme points and the collection of all intermediate points form a curve, called ROC curve.
Before we proceed with further discussions on the interpretation of ROC curves, let’s see an example. We consider once again our logistic regression model, LR_CRX, of the CRX data set. Recall that we computed the probabilities \(P(Y = +| X = x)\) for a test subset of the data and stored them in P_positive_test. The class labels for the same test subset were stored in test_class_labels. That is all we need to compute a ROC curve for our logistic regression classifier.
In R, we can use e.g. package ROCR to compute ROC as well as other related curves and classification performance measures. To compute the conventional ROC curve described above, we can use the following auxiliary function, adapted from (James et al. 2013):
library(ROCR)
rocplot <- function(pred, truth){
predobj <- prediction(pred, truth)
ROC <- performance(predobj, "tpr", "fpr")
plot(ROC) # Plot the ROC Curve
auc <- performance(predobj, measure = "auc")
auc <- auc@y.values[[1]]
return(auc) # Return the Area Under the Curve ROC
}
All we need to do is to call this function passing as arguments the predicted probabilities and the ground truth class labels:
AUC <- rocplot(pred=P_positive_test, truth=test_class_labels)
Fig. 7. ROC Curve example: logistic regression classifier for the CRX data set
Function rocplot() plots the ROC curve (Fig. 7) and returns the Area Under the Curve, AUC-ROC, which is a measure of classification performance.
AUC
## [1] 0.9164272
We can see that the AUC-ROC of our logistic regression classifier for a test subset of the CRX data set is around 0.91. It is straightforward to show (do some research on your own and discuss with your peers) that the ROC curve for the ideal classifier, which correctly classifies all observations, goes straight up vertically from \((0,0)\) to \((0,1)\), then horizontally from \((0,1)\) to \((1,1)\), and therefore its AUC is 1, which is the maximum possible value. It is also straightforward to show that a purely random classifier, which just “flips a coin” to perform classification, is expected to have a ROC curve that follows a diagonal straight line from \((0,0)\) to \((1,1)\), hence the expected AUC of such a classifier is 0.5. Values of AUC around 0.5 indicate performance similar to random, whereas values below 0.5 indicate performance even worse than random. The closer to AUC = 1, the better the classifier.
Exercise: In one of our previous exercises, we asked you to assess a KNN classifier using LOOCV, varying parameter \(k\) of the classifier from 1 to 20, and then plot a curve of LOOCV accuracy (in the y axis) as a function of \(k\) (x axis). Repeat that same experiment, but now: 1. replacing accuracy with AUC; and 2. replacing iris with the CRX data set (use only variables YearsEmployed, CreditScore, and AccountBalance as predictors). Important Notes: (a) you will need to retrieve from knn.cv() the KNN probabilities (majority vote proportions, enabled by prob=TRUE), rather than binary classifications only, but you also need the binary classifications returned from knn.cv() to check whether the majority vote refers to class positive (+) or negative (\(-\)). When it refers to class negative, you need to replace it with its complement (recall that \(P(Y = + | X) = 1 - P(Y = - | X)\)); (b) the predictors are in completely different scales, so they need to be standardised before KNN can be applied. Use function scale() from base R to standardise the predictors so they all have mean = 0 and standard deviation = 1; (c) you can optionally comment line plot(ROC) of function rocplot() to prevent a ROC curve from being displayed for each one of the 20 KNN classifiers.
Final Note: Further details and examples on cross-validation and related classification assessment procedures can be found, e.g., in Section 5.1 of (James et al. 2013) (see recommended reading below.) or in the Wikipedia Entry for this topic.
Utilising R/RStudio, reproduce the experimental procedures and results described in Section 4.6 of (James et al. 2013), “Lab: Logistic Regression, LDA, QDA, and KNN”, involving the Stock Market Data, Smarket, and the Caravan Insurance Data, Caravan, both from the ISLR package.
Complete the week 3 online quiz in Blackboard Learn Ultra.
Carter, C., and J. Catlett. 1986. “Using Machine Learning to Assess Credit Card Applications — Technical Report 288.” Dept. of Computer Science, The University of Sydney, Australia.
James, G., D. Witten, Hastie T., and R. Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. Springer.