Supervised Learning and the Bias-Variance Trade-Off

Recall that in supervised learning one wants to learn a model that describes a certain numerical or categorical variable as a function of other variable(s). Typically, the variable we want to describe and possibly predict is called the dependent or output variable, \(Y\), whereas the variable(s) used for the prediction are called independent or input variable(s), \(X = {X_{1}, \cdots, X_{n} }\), so-called predictor(s). It is presumed that the dependent variable can be described as a function \(f\) of the predictor(s), i.e., \(Y = f(X) + \epsilon\), except for a component \(\epsilon\) that depends on unobserved variables. The goal is to learn the mapping \(f(X)\) from data, using past observations \((X,Y)\) as a “teacher”. Since the model can only depend on the observed variable(s), \(X\), our data-driven learning problem reduces to find an \(\hat{f}\) that is as close to \(f\) as possible.

Notice that, even if an ideal model is learnt, i.e., \(\hat{f} = f\), any prediction made using this model will still have an error \(e = Y - \hat{Y} = f(X) + \epsilon - \hat{f}(X) = f(X) + \epsilon - f(X) = \epsilon\), which depends on unobserved variables and, therefore, cannot be eliminated. So, the goal is to minimize the reducible component of the error, \(f(X) - \hat{f}(X)\). However, since \(f(X)\) and \(\epsilon\) are unknown, we cannot separate them from each other prior to learning the model from observations \((X,Y)\). As a consequence, our model may try to reduce the irreducible component of the error during the learning process, or in other words, the model may be prone to “learning noise”.

Although the component \(\epsilon\) cannot be truly learnt from \(X\) as it does not depend on \(X\), if the model in hand is flexible enough it may be possible to learn a mapping \(\hat{f}(X)\) that perfectly maps each observation \(X\) to its corresponding \(Y\) in a finite set of observations, i.e., an exact model with null error for a training data set. As tempting as this idea may seem at a first glance, such a model has a fundamental problem: by definition, \(e = Y - \hat{Y} = f(X) + \epsilon - \hat{f}(X)\) and, by assumption, \(e = 0\) for the training data, which implies \(f(X) \neq \hat{f}(X)\), i.e., the model cannot be an accurate representation of \(f(X)\) unless the observation error \(\epsilon\) is negligible. Putting it differently, the model may fit accurately a particular sample of data, but not necessarily the true system that one wants to describe. In this case, we say that such a model is overfitted. Overfitted models tend to not accurately describe data other than the sample used for training.

It is worth noticing that, when we are learning a model from a finite data set (always the case in practical data-driven learning), overfitting can occur even if \(\epsilon = 0\), because there may exist an \(\hat{f} \neq f\) for which \(\hat{Y} = Y\) for a particular set of values of \(X\). As an example, assume a regression problem in which the true system we want to learn is given by \(Y = f(X) = X\), and we have a data set with two observations \((X,Y)\), namely, \((0,0)\) and \((1,1)\), which should suffice to uniquely determine the straight line \(Y = X\). However, in practice we may not know that the ideal system is linear, so we could come up with a quadratic model \(\hat{Y} = X^2\) instead, which would perfectly fit the two observations available for training, but is very different from the real system in general. Notice that by replacing observation \((1,1)\) with another observation drawn from the real (linear) system, namely \((2,2)\), we can still fit an exact quadratic model to the new training set ((0,0) and (2,2)) as \(\hat{Y} = \frac{1}{2}X^2\). However, the new model differs from the previous one. This simple example illustrates a well-known aspect of overfitting: the more overfitted the model, the more it tends to change when trained from different samples of data, which is undesired as the system or phenomenon we want to learn is the same irrespective of any particular data sample. This variability is related to the variance aspect of the so-called bias-variance trade-off.

In general, the more flexible the model and the smaller the training data set, the higher the risk of overfitting and the higher the variance tends to be. Since in most practical applications we cannot indefinitely increase the size of the training sample for a multitude of different reasons, the remedy to variance is to reduce the flexibility of the model, so it cannot adjust itself too much to the specifics of any particular sample. In our pedagogical problem above, if we force our model to be linear, we will always get the very same model when fitting it to any two points drawn from the real linear system. In contrast, notice that there are countless, much more flexible non-linear models that could perfectly fit the same observations while being very different from one another. Two such models would produce highly variable predictions for observations not contained in the training data set.

In our simple example above, we don’t pay any price for reducing variance by forcing the model to be linear, because the system to be modeled is linear. In practice, however, the system may contain non-linearities, which would not be captured by a linear model. When a model \(\hat{f}\) is less flexible than required to represent the phenomenon of interest, \(f\), the reducible component of the modeling error, \(f(X) - \hat{f}(X)\), cannot be eliminated and incurs in a modeling bias. In this case, we say that the model is underfitted.

In order to mitigate underfitting and reduce the modeling bias, more flexible models are required. However, as previously discussed, more flexible models are more prone to overfitting and tend to increase variance. This is the well-known bias-variance trade-off.

The above discussion on the bias-variance trade-off is informal and far from rigorous, provided with the sole intent to explain the basic intuition behind this concept, which plays a very important role in statistical learning and data mining. A more formal yet simple and elegant discussion on the bias-variance trade-off, containing practical examples and illustrations, is provided in Chapter 2 of (James et al. 2013). Students are required to read it (see recommended reading below.).

Bayes Classifiers

The bias-variance trade-off applies to supervised learning in general, most noticeably regression and classification. Recall that in classification we want to model a categorical dependent variable \(Y\) as a function of the predictor(s) \(X\). The dependent variable takes on a finite number of nominal values, so-called class labels, each of which represents one of the classes to which data observations can belong to.

The so-called Bayes Rule of Classification is a classification strategy that seeks to assign each observation to the most likely class, given its predictor(s) values. In other words, given particular values \(x\) for the predictor(s), i.e. \(X = x\), we want to find the particular class label \(y\) for which the probability that the observation in question belongs to that class is the highest. Mathematically, we want to find \(Y = y\) such that \(P(Y = y | X = x)\) is maximized. \(P(Y = y | X = x)\) is the conditional probability that \(Y\) belongs to class \(y\) given that the values for the predictor(s) \(X\) are known to be \(x\). The classification problem is how to estimate these (unknown) probabilities.

There are two major strategies to estimate \(P(Y = y | X = x)\). The first strategy is to estimate these probabilities from data directly. Two classic classifiers based on the direct approach are the k-nearest neighbours (kNN) and the logistic regression classifiers, which will be discussed in subsequent modules of this subject. The second approach is based on the use of the Bayes Theorem to estimate \(P(Y = y | X = x)\) in an indirect rather than direct way. Three classic classifiers based on this approach are the Naive Bayes classifier, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA), which will be discussed in the following. Before we proceed, however, we will first briefly review the Bayes Theorem.

Bayes Theorem

In its simplest form, the Bayes Theorem essentially states that:

Equation (1): \(\large P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}\)

Eq. (1) says that the conditional probability of a certain event \(Y\) occur given that another event \(X\) has occurred, \(P(Y|X)\), can be written as a function of the conditional probability of \(X\) given \(Y\), \(P(X|Y)\), as well as the unconditional probabilities \(P(X)\) and \(P(Y)\) (so-called priors).

As an example, let’s consider a box with \(N\) = 100 marbles, each of which has two painted faces (possibly, but not necessarily, in the same colour). We know that there are, say, 40 marbles with at least one face painted in red, 20 marbles with at least one face painted in blue, and 5 marbles with one face in red and the other one in blue. So, the probability that we randomly pick a marble from the box that has a face painted in red is \(P(X) = 40/100 = 0.4\), and the probability that such a randomly picked marble has a blue face is \(P(Y) = 20/100 = 0.2\). Now let’s suppose that we blindly draw a marble from the box and look at one of its painted faces only, which turns out to be red. What is the probability that the other face is blue?

To answer this question, we can use the Bayes theorem above. In fact, notice that we can easily determine the probability that one of the faces is red given that the other face is blue as \(P(X|Y) = 5/20 = 0.25\), since we know that 5 out of the 20 marbles with a blue face have the other face red. Using the Bayes theorem, the probability we want to compute in order to answer our question above is given by \(P(Y|X) = (0.25*0.2)/0.4 = 0.125\). Notice that, before looking at one of the faces (which turned out to be red), the prior probability that the picked marble would be red and blue was only \(P(X \& Y) = 5/100 = 0.05\). Given the evidence that one of the faces was certainly red, the probability of observing a red and blue marble increased from \(0.05\) to \(0.125\).

In the above example, the evidence consists of a single random variable, but the Bayes theorem also applies in case of a multivariate \(X\), i.e., \(X = \{X_{1}, \cdots, X_{n} \}\). In this case, the probability of \(Y\) given that multiple events \(X_1\) and \(X_2\) and … \(X_n\) have all occurred is given by:

Equation (2): \(\large P(Y|X_1 \& \cdots \& X_n) = \frac{P(X_1 \& \cdots \& X_n|Y)P(Y)}{P(X_1 \& \cdots \& X_n)}\)

Naive Bayes Classifier

The basic rationale behind the Naive Bayes classifier is to estimate the probabilities \(P(X_1 \& \cdots \& X_n|Y)\), \(P(Y)\) and \(P(X_1 \& \cdots \& X_n)\) from data, then apply the Bayes theorem in Eq. (2) to estimate \(P(Y|X_1 \& \cdots \& X_n)\). In principle, Naive Bayes assumes that not only the dependent variable, \(Y\), but also the predictor(s), \(X\), are categorical (Numerical Predictors will be discussed later). The classifier then uses a frequentist approach to estimate the required probabilities from data. In this approach, probabilities are estimated by the frequency in which the event in question occurs in a set of experiments. For example, if we didn’t know that the probability of any particular outcome of a rolling die experiment is 1/6, we could roll the die several times, count how many times we have observed each particular outcome, and then divide these counts by the total number of times that the die has been rolled. If the die is fair and we roll it many times, we should get a probability of approximately 1/6 for each possible outcome.

Likewise, in a classification setting, if we are given a large number of observations \((X,Y)\), then we can estimate from these observations, in a frequentist way, the probability \(P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = y)\) that the predictors take on particular values, \(X_1 = x_1, \cdots, X_n = x_n\), given that an observation belongs to a particular class, \(Y = y\). We can also estimate the prior probability that an observation belongs to class \(y\), \(P(Y = y)\), and the prior probability that the predictors take on those particular values, \(P(X_1 = x_1 \& \cdots \& X_n = x_n)\). Once those probabilities have been computed, we can then use the Bayes theorem to compute the probabilities \(P(Y = y|X_1 = x_1 \& \cdots \& X_n = x_n)\) that a certain observation belongs to any particular class, \(Y = y\), provided that its predictors take on values \(X_1 = x_1, \cdots, X_n = x_n\). Recall from the Bayes Rule of Classification that this is all we need to classify a new observation, whose predictors have been measured but the class label is unknown.

Actually, in order to apply the Bayes rule of classification we don’t really need the exact probability values, we only need to know which probability is the largest. Since the denominator in Eq. (2) — i.e. the prior \(P(X_1 = x_1 \& \cdots \& X_n = x_n)\) — is the same irrespective of the class label, we actually only need to compute the numerator, i.e., we only need to estimate \(P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = y)\) and \(P(Y = y)\) for each class label \(Y = y\). The problem is that the number of observations required to get a rough frequentist estimate of the former term, in principle, grows exponentially with the number of predictors, \(n\), thus making the frequentist approach unfeasible in most practical applications. The Naive Bayes classifier circumvents this problem by making the assumption that the predictors are statistically independent within each class, which means that no particular value taken by a certain predictor affects the probabilities associated with the other predictors within a given class. Mathematically, if the class conditional independence assumption holds true (for any given class label, \(Y = y\)), then:

\[P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = y) = P(X_1 = x_1|Y = y)*P(X_2 = x_2|Y = y) * \cdots * P(X_n = x_n|Y = y)\]

This property makes it much easier to estimate \(P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = y)\) by independently estimating and multiplying the probabilities associated with each individual predictor. Under the independence assumption, the Bayes theorem in Eq. (2) can be rewritten as:

Equation (3): \(\large P(Y|X_1 \& \cdots \& X_n) = \frac{P(X_1 = x_1|Y = y) P(X_2 = x_2|Y = y) \cdots P(X_n = x_n|Y = y) P(Y)}{P(X_1 = x_1|Y = y) P(X_2 = x_2|Y = y) \cdots P(X_n = x_n|Y = y)} = \frac{P(Y) \prod_{i=1}^{n} P(X_i = x_i|Y = y)}{\prod_{i=1}^{n} P(X_i = x_i|Y = y)}\)

Why is the independence assumption so important in practice? To answer this question, let’s recall our rolling dice experiment, but now assuming that we want to estimate the probability of a certain outcome, say 3 and 5, when rolling two dice (or the same die twice), rather than a single die. Since we know that the outcome of the first die does not affect the probabilities for the second die and vice-versa (i.e., the two events are independent), it is clear that the probability of any outcome, such as (3,5), is (1/6)*(1/6) = 1/36. So, instead of rolling the two dice until we observe (3,5) a number of times that allows us to estimate the probability as roughly 1/36, which would require a very large number of experiments/observations, we instead can roll a single die a much smaller number of times to estimate the probabilities of 3 and 5 independently as roughly 1/6, and then multiply those to get a rough estimate of 1/36.

Why this approach, based on the class conditional independence assumption, is called Naive in the context of classification? It is “naive” because databases in practical applications usually contain predictors that are correlated to some degree. For instance, wages tend to be higher among senior professionals, so it may correlate with age, which in turn may also be correlated with body weight within certain population groups, so on and so forth. Interestingly, however, the Naive Bayes classifier usually exhibit good performance even when the independence assumption is violated. The reason is that the classifier does not require the exact probability values \(P(Y|X_1 \& \cdots \& X_n)\), whose estimates using Eq. (3) can be inaccurate due to the violation of the independence assumption, but such inaccuracies will cause no harm as long as the largest probability can still be correctly determined.

As an example, let’s consider the Weather data set from (I. Witten et al. 2016), displayed in Fig. 5.

Fig. 5. Weather data set (I. Witten et al. 2016)

In this illustrative data set, there are four categorical predictors, namely, “Outlook” (\(X_1\)), “Temp[erature]” (\(X_2\)), “Humidity” (\(X_3\)), and “Windy” (\(X_4\)), which describe the weather conditions of 14 different days (observations, i.e., rows of the table). The goal is to model whether (“Play” = Yes) or not (“Play” = No) a particular day is suitable to play. Variable “Play” is our dependent variable (\(Y\)), and since it takes two values only, this is a binary classification problem.

In order to classify any new observation into classes Yes or No, we need a table of probabilities containing the priors, \(P(Y = \rm{Yes})\) and \(P(Y = \rm{No})\), as well as \(P(X_i|Y)\) for all possible combinations of values of \(X_i\) (\(i = 1, \cdots, 4\)) and \(Y\). Since 9 out of the 14 observations belong to class Yes, it is trivial to compute the priors as \(P(Y = \rm{Yes}) = 9/14\) and \(P(Y = \rm{No}) = 5/14\). The conditional probabilities \(P(X_i|Y)\) are also trivial to compute by inspecting the data set. For example, \(P(X_1 = \rm{Sunny}|Y = \rm{No}) = 3/5\), because 3 out of the 5 observations for which \(Y = \rm{No}\) are such that \(X_1 = \rm{Sunny}\). As another example, \(P(X_4 = \rm{False}|Y = \rm{Yes}) = 6/9 = 2/3\), because 6 out of the 9 observations for which \(Y = \rm{Yes}\) are such that \(X_4 = \rm{False}\).

We can easily compute the complete table of probabilities in R. There are multiple R implementations of the Naive Bayes classifier, for instance in packages naivebayes and e1070. In the following we show how to compute the Naive Bayes probability table for the Weather data set (available in a CSV file Weather_Data_Witten_Frank_Book.csv) using function naive_bayesfrom package naivebayes:

library(naivebayes)
Weather_Data <- read.table("Weather_Data_Witten_Frank_Book.csv", header=TRUE, sep = ",", dec = ".",
                           col.names = c("X1_Outlook","X2_Temp","X3_Humidity","X4_Windy","Y_Play"))
NaiveBayesWeather <- naive_bayes(Y_Play ~., data = Weather_Data)

In the call to function naive_bayes, Y_Play ~. specifies that variable Y_Play (as named in the call to read.table) is the dependent variable and all the others are the predictors. The Naive Bayes probability table is contained in the model returned by the naive_bayes function, and can be visualised by just typing the model name:

NaiveBayesWeather

## ===================== Naive Bayes ===================== 
## Call: 
## naive_bayes.formula(formula = Y_Play ~ ., data = Weather_Data)
## 
## A priori probabilities: 
## 
##        No       Yes 
## 0.3571429 0.6428571 
## 
## Tables: 
##           
## X1_Outlook        No       Yes
##   Overcast 0.0000000 0.4444444
##   Rainy    0.4000000 0.3333333
##   Sunny    0.6000000 0.2222222
## 
##        
## X2_Temp        No       Yes
##    Cool 0.2000000 0.3333333
##    Hot  0.4000000 0.2222222
##    Mild 0.4000000 0.4444444
## 
##            
## X3_Humidity        No       Yes
##      High   0.8000000 0.3333333
##      Normal 0.2000000 0.6666667
## 
##         
## X4_Windy        No       Yes
##    False 0.4000000 0.6666667
##    True  0.6000000 0.3333333

Now that we have trained the classifier, how do we use it to classify a new observation? For example, following (I. Witten et al. 2016), let’s assume that there is a new day for which Outlook is Sunny, Temperature is Cool, Humidity is High, and Windy is True. Based on the Naive Bayes model of our data, is this a good day to play?

To answer this question, we need \(P(Play = \rm{?}|Outlook = \rm{Sunny}, Temp = \rm{Cool}, Humidity = \rm{High}, Windy = \rm{True})\) for both “Play” = Yes and No, so we can assign the new observation to the largest. As discussed before, we only need to compute the numerators in Eq. (3), because the denominator is the same for both classes. Hence, for class “Play” = Yes we have the numerator in Eq. (3) given by:

\[P(\rm{Sunny}|\rm{Yes})*P(\rm{Cool}|\rm{Yes})*P(\rm{High}|\rm{Yes})*P(\rm{True}|\rm{Yes})*P(Play = \rm{Yes})=2/9*3/9*3/9*3/9*9/14 = 0.0053\]

Similarly, for class “Play” = No we have:

\[P(\rm{Sunny}|\rm{No})*P(\rm{Cool}|\rm{No})*P(\rm{High}|\rm{No})*P(\rm{True}|\rm{No})*P(Play = \rm{No})=3/5*1/5*4/5*3/5*5/14 = 0.0206\]

Since the latter is larger than the former, we classify the new observation right away as “Play” = No. If we want probability values as a measure of confidence in our classification, we just normalise the above values so they add up to 1:

\[P(Play = \rm{Yes}|Outlook = \rm{Sunny}, Temp = \rm{Cool}, Humidity = \rm{High}, Windy = \rm{True}) = 0.0053/(0.0053 + 0.0206) = 0.2046\]

\[P(Play = \rm{No}|Outlook = \rm{Sunny}, Temp = \rm{Cool}, Humidity = \rm{High}, Windy = \rm{True}) = 0.0206/(0.0053 + 0.0206) = 0.7954\]

As another example, for a new day in which Outlook is Rainy, Temperature is Hot, Humidity is High, and Windy is False, we would get (do it yourself as an exercise):

\[P(Play = \rm{Yes}|Outlook = \rm{Rainy}, Temp = \rm{Hot}, Humidity = \rm{High}, Windy = \rm{False}) = 0.01058/(0.01058 + 0.01828) = 0.3666\]

\[P(Play = \rm{No}|Outlook = \rm{Rainy}, Temp = \rm{Hot}, Humidity = \rm{High}, Windy = \rm{False}) = 0.01828/(0.01058 + 0.01828) = 0.6334\]

Therefore, this observation would also be classified as “Play” = No, but now with less confidence than the previous one.

In R, we can build a separate data set with the new observations we want to classify, making sure that the new data set has exactly the same properties as the original data set, and then we use function predict with attribute type = class to perform the classification:

# Append two new observations to the original data table (new rows 15 and 16):
New_Weather_Data <- rbind(Weather_Data, c("Sunny","Cool","High","True",NA), c("Rainy","Hot","High","False",NA))
# Perform classification of the new observations:
Pred_class <- predict(NaiveBayesWeather, newdata = New_Weather_Data[15:16,], type = "class")
# Confirm that the two observations have indeed been classified as "No":
Pred_class

## [1] No No
## Levels: No Yes

If we use type = prob in lieu of type = class, instead of a binary classification we get the corresponding class conditional probabilities (which we had already computed manually — see above):

Pred_prob <- predict(NaiveBayesWeather, newdata = New_Weather_Data[15:16,], type = "prob")
Pred_prob

##             No       Yes
## [1,] 0.7954173 0.2045827
## [2,] 0.6334311 0.3665689

If we want to classify all the observations, including the ones used for training, we can just replace newdata = New_Weather_Data[15:16,] with newdata = New_Weather_Data in the above code. If we want to classify only the observations used for training (sanity check), we can use newdata = NULL:

Pred_class <- predict(NaiveBayesWeather, newdata = NULL, type = "class")
Pred_class

##  [1] No  No  Yes Yes Yes Yes Yes No  Yes Yes Yes Yes Yes No 
## Levels: No Yes

We can easily compare the predicted classes against the real classes (5th column of the data set) by building a contingency table:

cont_tab <- table(Pred_class, Weather_Data$Y_Play)
cont_tab

##           
## Pred_class No Yes
##        No   4   0
##        Yes  1   9

The table shows that 4 observations from class “Play” = No have been correctly classified as such, 1 observation (the 6th) has been misclassified as “Play” = Yes, whereas all 9 observations from class “Play” = Yes have been correctly classified as such. This gives us a Classification Error for the training data set (so-called Reconstruction Error) of 1/14, or equivalently, a Classification Accuracy of 13/14:

# Accuracy = (cont_tab[1,1] + cont_tab[2,2]) / (cont_tab[1,1] + cont_tab[1,2] + cont_tab[2,1] + cont_tab[2,2])
(reconstruction_accuracy <- sum(diag(cont_tab))/sum(cont_tab))

## [1] 0.9285714

# Error = (cont_tab[1,2] + cont_tab[2,1]) / (cont_tab[1,1] + cont_tab[1,2] + cont_tab[2,1] + cont_tab[2,2])
(reconstruction_error <- 1 - reconstruction_accuracy)

## [1] 0.07142857

In other words, approximately 92% of the data has been correctly classified. Although this may sound quite good at a first glance, there are two potential pitfalls when interpreting this figure. The first one is that good prediction performance for the same data set used to train the model provides no guarantees at all in terms of prediction power in general, due to the risk of overfitting. For this reason, reconstruction error/accuracy should never be used as a measure of classification performance. Assessing the prediction power of classifiers is a topic that will be discussed subsequently throughout this subject; for now, it is important to keep in mind that classifiers should be assessed using a separate (so-called test or validation) data set, i.e., a representative subset of data concealed from the training and reserved only for assessment of the resulting model.

The second pitfall is concerned with the error or accuracy measures themselves, which can be misleading even when a validation data set is in place. Notice that, if the data set is imbalanced, a dummy “classifier” that assigns every observation to the majority class (“Play” = Yes in the Weather data set) may reach high accuracy values despite of being useless. For example, in an extreme imbalanced case in which the majority class corresponds, say, to 99% of the observations, such a dummy strategy will reach 99% accuracy! For this reason, other alternative measures of classification performance should also be used alongside with accuracy.

Three very common measures for classification assessment are Precision, Recall, and F1-Measure. Do some research on your own and discuss these measures with your peers. For instance, the Wikipedia Entry on F1 Score is a good starting point.

Naive Bayes with Numerical Predictors

When the data set contains one or more numerical, real-valued predictors, \(X_i\), we can no longer estimate the probabilities \(P(X_i|Y)\) in a frequentist way, because such predictor(s) can take an infinite number of values. The workaround used by the Naive Bayes classifier is to model these probabilities using some form of probability density function. The most common approach uses the Normal (Gaussian) distribution, case in which the Naive Bayes classifier is closely Related to Linear and Quadratic Discriminant Analysis (LDA and QDA). This approach is the default setting in function naive_bayes from package naivebayes.

Example and Exercise

Let’s consider once again the data set iris from base R (datasets package). As we know, this data set contains 4 numerical predictors, besides the dependent class variable:

data(iris)
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

We will randomly split the data set into a training set (with 80% of the observations) and a test set (with the remaining 20% of the observations):

set.seed(0)
no_obs <- dim(iris)[1] # No. of observations (150)
test_index <- sample(no_obs, size = as.integer(no_obs*0.2), replace = FALSE) # 20% data records for test
training_index <- -test_index # 80% data records for training

Then, we will use the training set to train a Naive Bayes classifier:

NaiveBayesIris <- naive_bayes(Species ~., data = iris[training_index,])
NaiveBayesIris

## ===================== Naive Bayes ===================== 
## Call: 
## naive_bayes.formula(formula = Species ~ ., data = iris[training_index, 
##     ])
## 
## A priori probabilities: 
## 
##     setosa versicolor  virginica 
##  0.3166667  0.3416667  0.3416667 
## 
## Tables: 
##             
## Sepal.Length    setosa versicolor virginica
##         mean 5.0026316  5.9878049 6.5365854
##         sd   0.3522030  0.5095072 0.6191753
## 
##            
## Sepal.Width    setosa versicolor virginica
##        mean 3.4131579  2.7536585 2.9926829
##        sd   0.3588051  0.3264181 0.3445216
## 
##             
## Petal.Length    setosa versicolor virginica
##         mean 1.4500000  4.2658537 5.4975610
##         sd   0.1812121  0.4783355 0.5659010
## 
##            
## Petal.Width    setosa versicolor virginica
##        mean 0.2552632  1.3292683 2.0439024
##        sd   0.1155419  0.2052363 0.2665040

Finally, we assess the prediction power of the obtained classifier in the subset of data reserved for test:

Pred_class <- predict(NaiveBayesIris, newdata = iris[test_index,], type = "class")
(cont_tab <- table(Pred_class, iris$Species[test_index]))

##             
## Pred_class   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0          9         1
##   virginica       0          0         8

((accuracy <- sum(diag(cont_tab))/sum(cont_tab)))

## [1] 0.9666667

We can see that only one out of 30 observations was misclassified, which means an accuracy of about 96%. However, notice that this may change if we select a different random split of the data into 80% for training and 20% for test.

Exercise: Instead of randomly splitting the data into a training and a test set only once, repeat this procedure 10 times (use a for loop like for(i in 1:10){# your code}) and compute the mean and the standard deviation of the classification accuracy over the 10 test sets.

Linear Discriminant Analysis (LDA)

Like the Naive Bayes classifier, Linear Discriminant Analysis (LDA) also uses the Bayes theorem to indirectly estimate the probability \(P(Y|X)\), but it does not make the assumption of (class conditional) independence of predictors, so Eq. (3) cannot be used. Instead, LDA uses a version of Eq.(2) where the denominator is rewritten based on the fact that the class labels impose a partition of the sampling space and, therefore, if there are \(k\) classes, \(Y = 1, \cdots, k\), then \(P(X_1 \& \cdots \& X_n) = \sum_{i=1}^{k} P(Y = i)*P(X_1 \& \cdots \& X_n|Y = i)\). This property reads as “The probability of observing \(X_1\) and \(X_2\) and … and \(X_n\) equals the probability of observing these predictor values given that the corresponding observation belongs to class \(Y = 1\), plus the probability of observing these predictor values given that the corresponding observation belongs to class \(Y = 2\), plus … plus the probability of observing these predictor values given that the corresponding observation belongs to class \(Y = n\)”. This is true because the class labels of observations are mutually exclusive. Hence, using this property, Eq. (2) can be rewritten as:

Equation (4): \(\large P(Y = y|X_1 = x_1 \& \cdots \& X_n = x_n) = \frac{P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = y)P(Y = y)}{\sum_{i=1}^{k} P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = i)*P(Y = i)}\)

Unlike the Naive Bayes classifier, which can operate with both categorical as well as numerical predictors, LDA necessarily requires that all predictors be numerical. It then models the class conditional probabilities \(P(X_1 = x_1 \& \cdots \& X_n = x_n|Y = i)\), which appear both in the numerator and in the denominator of Eq. (4), by means of a multivariate Normal distribution (a parametric probability density function). The distribution parameters and the a priori class probabilities \(P(Y=i)\) are numerically estimated using the data set available for training. The mathematical details are omitted here; we refer the interested reader to Section 4.4 of (James et al. 2013) (see recommended reading below.).

One detail that is worth mentioning is that LDA not only assumes that the observations of each class follow a particular multivariate Normal distribution, but it goes further and assumes that the classes share a common covariance matrix. In other words, it assumes that the classes follow their own multivariate Normal distributions, which however can only differ in terms of their multivariate means, but not in terms of their multivariate spreads. The main implications of this assumption are twofold: on the one hand, the problem becomes mathematically and numerically simpler, whose solution is an LDA decision rule that depends on the features only through a linear combination of its elements; on the other hand, this means that the decision boundary between any two classes is linear (a hyperplane in the feature space — this is the reason for the word “linear” in LDA). This means that LDA tends not to be suitable for non-linear classification problems, or even linear problems where the classes follow distributions that do not (at least approximately) satisfy the assumptions.

Training and testing a LDA classifier in R is similar to the Naive Bayes classifier. Here we use function lda from package MASS:

library (MASS)
set.seed(0)
no_obs <- dim(iris)[1] # No. of observations (150)
test_index <- sample(no_obs, size = as.integer(no_obs*0.2), replace = FALSE) # 20% data records for test
training_index <- -test_index # 80% data records for training
lda.fit <- lda(Species ~., data=iris, subset = training_index)
lda.pred <- predict(lda.fit, iris[test_index,])
lda.Pred_class <- lda.pred$class
cont_tab <- table(lda.Pred_class, iris$Species[test_index])
cont_tab

##               
## lda.Pred_class setosa versicolor virginica
##     setosa         12          0         0
##     versicolor      0          9         0
##     virginica       0          0         9

Notice that the trained model provided a perfect classification of the above 20% test subset for the iris data set. Once again, however, it should be noticed that this result may be different for a different random split of the data into 80% for training and 20% for test.

Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis follows the same model as Linear Discriminant Analysis (LDA), except that it drops the assumption that the class conditional distributions must share a common covariance matrix. In other words, QDA models the distribution of each class by means of an independent multivariate Normal probability density function. The main implications of allowing class conditional distributions with different covariances are twofold. On the one hand, the model is more flexible in the sense that its assumptions are less restrictive in practice (i.e. they are more likely to be met, at least approximately) as well as in the sense that the resulting decision boundary is no longer necessarily linear (it can be non-linear). On the other hand, there is the need to estimate multiple covariance matrices, which means that the number of parameters to be estimated is larger, thus making the model and its training mathematically and computationally more complex as well as more prone to overfitting.

Training and testing a QDA classifier in R is similar to the Naive Bayes and LDA classifiers. Here we use function qda from package MASS:

library (MASS)
set.seed(0)
no_obs <- dim(iris)[1] # No. of observations (150)
test_index <- sample(no_obs, size = as.integer(no_obs*0.2), replace = FALSE) # 20% data records for test
training_index <- -test_index # 80% data records for training
qda.fit <- qda(Species ~., data=iris, subset = training_index)
qda.pred <- predict(qda.fit, iris[test_index,])
qda.Pred_class <- qda.pred$class
cont_tab <- table(qda.Pred_class, iris$Species[test_index])
cont_tab

##               
## qda.Pred_class setosa versicolor virginica
##     setosa         12          0         0
##     versicolor      0          9         0
##     virginica       0          0         9

Relations between Naive Bayes, LDA and QDA

When its model assumptions are satisfied and the class conditional distributions are known, QDA corresponds to the so-called Optimal Bayes Classifier, which is optimal in the sense that it minimizes the probability (or risk) of misclassification. LDA is just a particular case of QDA with more restrictive assumptions. Both LDA and QDA require numerical predictors. Naive Bayes, in turn, can operate with categorical and mixed (i.e. numerical and categorical) predictors as well; however, when all predictors are numerical, Naive Bayes is a particular case of LDA and, accordingly, it is also a particular case of QDA. These classifiers are not affected by variable rescaling.

Exercise: Practice your skills by training, testing and comparing the results of the Naive Bayes, LDA, and QDA classifiers in other classification data sets with numerical predictors (other than iris).

Activities

Assignment 1

Wrapper Naive Bayes for Feature Selection and Dimensionality Reduction in Classification (see separate assignment sheet)

References

James, G., D. Witten, Hastie T., and R. Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. Springer.

Witten, I., E. Frank, M. Hall, and C. Pal. 2016. Data Mining: Practical Machine Learning Tools and Techniques. 4th ed. Morgan Kaufmann.

Introduction to Data Mining - Week 2

Ricardo Campello

31 January 2018