What is Data Mining?

Recall that data science involves the whole lifecycle of data, as displayed in Fig. 1.

Fig. 1. Data Science and the Cycle of Knowledge Discovery from Databases (KDD)

In this process, the cycle of Knowledge Discovery from Databases (KDD) plays a fundamental role. Data Mining lies on the core of this cycle. It refers to a collection of mathematical and computational methods and techniques to extract models and patterns from data.

In the scientific literature, the term data mining is defined in slightly different ways, e.g.:

“The process of discovering interesting patterns and knowledge from large amounts of data” (Han, Kamber, and Pei 2012)

“The process of automatically discovering useful information in large data repositories” (Tan, Steinbach, and Kumar 2006)

“The process of discovering insightful, interesting, and novel patterns, as well as descriptive, understandable, and predictive models from large-scale data” (Zaki and Meira JR. 2014)

Core Data Mining Tasks

There are numerous problems that data scientists may aim to solve in the realm of data mining, but the vast majority can be categorized as a variant, specialisation or particular case of one of the following core data mining tasks:

Regression
Classification
Anomaly/Outlier Detection
Clustering
Frequent Pattern Mining
Recommendation

These tasks are decribed in more detail in the sequel.

Predictive Tasks

In predictive data mining tasks, we want to predict the unknown value of a certain variable using the known values of other variables. Typically, the variable we want to predict is called the dependent or output variable, \(Y\), whereas the variable(s) used for the prediction are called independent or input variable(s), \(X = \{X_{1}, \cdots, X_{n} \}\), so-called predictor(s). Past observations from both \((X,Y)\) serve as a “teacher” for a model to be trained from data, and the training of such a model is referred to as supervised learning. The most common predictive tasks in data mining are regression and classification.

Regression

In regression, the dependent variable \(Y\) is numeric, and it is presumed that it can be described as a function \(f\) of the predictor(s) \(X\), i.e., \(Y = f(X) + \epsilon\), except for a component \(\epsilon\) that depends on unobserved variables (observation error). In a nutshel, the main goal of the regression task is to learn a model for the unknown mapping \(f\) from a data set containing a representative collection of observations for which the input-output values \((X,Y)\) are known. If such a model is successfully learnt, it can be used later to predict the unknown value of \(Y\) given known values of \(X\).

For instance, let us consider dataset diamonds from the ggplot2 package in R/RStudio. It seems reasonable to assume that we can approximately model the price of a diamond as a function of its weight. In this case, the dependent variable \(Y\) is price and the single predictor \(X_{1}\) is carat (weight). After some suitable pre-processing of the raw data (in this case a log-transformation of both price and carat), if we assume that \(Y\) can be approximated as a linear function of \(X_{1}\), i.e., \(Y \approx \beta_{0} + \beta_{1}X_{1}\), where parameters \(\beta_{0}\) and \(\beta_{1}\) are the slope and the intercept, respectively, then we can determine the values of these parameters so that the resulting model best fits the data:

library(ggplot2)
ggplot(diamonds, aes(x=log2(carat),y=log2(price))) + geom_point() + geom_smooth(method="lm")

Fig. 2. Regression Example: diamonds dataset

In Fig. 2, the blue line (produced by geom_smooth(method="lm")) is the resulting linear model. Once this model has been obtained, given any value of \(X_{1}\) (carat) we can predict the value of \(Y\) (price) as the corresponding point on the blue line.

Prediction is not the only goal of regression though. Sometimes, understanding how certain changes in \(X\) affect \(Y\) (in desirable or undesirable ways) can be even more important. For instance, if scientists learn that some dosage of a certain substance \((X)\) tends to lower the level of blood colesterol \((Y)\), this substance could be a candidate compound for new medicine to treat high blood colesterol.

Classification

In classification, we also want to model a dependent variable \(Y\) as a function of the predictor(s) \(X\), the difference is that \(Y\) is a categorical (rather than numerical) variable, i.e., it takes on a finite number of nominal values, so-called class labels, each of which represents one of the classes to which data observations can belong to.

For instance, let us consider dataset iris from base R (datasets package). This is a famous dataset introduced in 1936 by Ronald Fisher. It contains 150 observations of flowers, 50 observations from each of three species of Irises, namely, setosa, virginica, and versicolor. These are the values (factor levels) for the 5th variable (5th column, Species). The other four variables (1st to 4th columns) are the length and width of the sepal and petal, respectively (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, in centimetres). From the point of view of classification, one may want to be able to predict the species of a new observation (Iris flower) given its sepal and petal measurements. For the sake of illustration, let us put the lengths aside and focus only on two predictors (widths) so we can visualise the data:

ggplot(data = iris) + geom_point(mapping = aes(x = Sepal.Width, y = Petal.Width, color = Species))

Fig. 3. Classification Example: iris dataset

In Fig. 3, the class labels are represented by different colours. The figure suggests that a simple, approximate rule-based classification for this dataset could be Species = setosa if Petal.Width <= 0.75, Species = versicolor if 0.75 < Petal.Width < 1.75, and Species = virginica if Petal.Width >= 1.75.

Like in regression, the main goals in classification are twofold. One the one hand, an analyst may want to predict the unknown class label for a new observation whose predictors are known, such as whether or not a bank client is likely to default a possible loan based on that client’s personal information and past banking records. On the other hand, the analyst may actually/also want to understand what the main factors are that contribute to a certain classification, e.g., which combinations of predictors and their values tend to be associated with clients who default a loan, or which combinations of genetic predictors and their values tend to be associated with tissue samples classified as cancerous, as opposed to healthy.

Supervised Anomaly Detection

Supervised anomaly detection is a particular case of classification in which one wants to discriminate between “normal” and “anomalous” observations. Classes “normal” and “anomaly” depend on the particular application in hand. For instance, a credit card company may want to automatically flag a transaction as legitimate (to be approved) or fraudulent (to be denied). To train such a classifier, a database of previous transactions that are known to be either legitimate or fraudulent is used. Similarly, an IT department may want to automatically flag a computer session as legitimate or unauthorised, based on past reports of network activity and known intrusions.

In this (supervised) setting, anomaly detection is therefore just a binary classification problem. However, since anomalies are rare (in relative terms when compared to normal observations and, in some domains, also in absolute terms), the problem becomes highly imbalanced, which poses a number of challenges to conventional classification methods. For this reason, supervised anomaly detection usually requires the use of special classification techniques.

Descriptive Tasks

We have seen that in predictive tasks we have a target variable, \(Y\), and the goal is to learn a model that maps some suitable input variable(s) \(X\) into \(Y\), in a supervised way. In many scenarios, however, we don’t have a dependent variable \(Y\), but only the input variable(s), \(X\). Yet, there may be valuable patterns that analysts may want to learn from \(X\) alone. Since these patterns are learnt in the abscence of a “teacher” \(Y\), the process is usually referred to as unsupervised learning.

In data mining, unsupervised learning is usually associated with descriptive tasks, which are mainly meant to detect and explain interesting patterns observed from the current dataset, rather than explicitly build a model for future predictions. The most common descriptive tasks in data mining are clustering, outlier detection, and frequent pattern mining.

Clustering

In clustering we are interested to find groups (clusters) of observations that are more similar or related to each other than observations in other groups. Cluster analysis is a well-established field of statistics since the 1950s, when analysts were interested in finding, for instance, groups of customers with similar purchasing behaviour and/or interests. This is a classic problem in cluster analysis called market segmentation. With the advent of modern techniques that are both effective and computationally efficient to tackle large amounts of data, clustering has drawn increasing attention among data scientists. Nowadays, it finds applications in virtually all fields of knowledge (e.g. biology, astronomy, geology, archeology, finance, business, just to mention a few). In genetics, for example, analysts may want to find groups of genes whose expression signatures under certain conditions are related, as these may be jointly associated with a certain function (or disfunction) in a given organism.

Cluster analysis plays an important role not only as a data mining tool on its own, but also as a useful tool for data pre-processing, visualisation and exploratory data analysis, typically in early stages of the KDD cycle, before other techniques can be applied. Clustering is sometimes referred to as unsupervised classification, but should not be confused with supervised classification, which is a different task. To illustrate the difference between clustering and supervised classification, let us consider the iris dataset once again. Here, however, we ignore the class labels (variable Species, 5th column), because labels would not be available in a clustering application.

ggplot(data = iris) + geom_point(mapping = aes(x = Sepal.Width, y = Petal.Width))

Fig. 4. Clustering Example: iris dataset

Notice in Fig. 4 that rather than the three coloured classes in Fig. 3 we see instead two natural groups of objects. These groups are called clusters, and it is what the analyst is typically looking for in a clustering application. The fact that the upper cluster actually consists of observations from two different Irises is, in principle, irrelevant, as the goals of clustering and classification are different. Still, certain clustering algorithms can detect that this cluster is composed of subclusters (subcategories) that roughly correspond to the two Irises, specially by using additional variables. Notice that, when the data to be clustered is described by more than two variables, we can no longer visually detect clusters by just looking at a scatter plot like Fig. 4. That’s when clustering algorithms to automatically detect clusters play a major role.

Unsupervised Outlier Detection

We have learnt that in clustering the analyst is looking for patterns that emerge as clusters of objects that are somehow more similar or related to each other than to observations in other clusters. Since clusters are formed by a collection of similar observations, they can be interpreted as a usual pattern in the data. However, not all observations fit well these patterns. Observations that largely deviate from other observations and do not fit well any cluster in the data are called outliers. Unlike clusters, outliers are interpreted as unusual patterns. Detecting these unusual patterns is important in many applications, because outliers are candidates to be anomalies. However, unlike supervised anomaly detection in which the anomaly to be detected is previously known, outliers are potentially unknown anomalies yet to be discovered. For instance, they can be a new type of fraud or network intrusion not identified before, so there are no labeled observations to train a classifier.

In Fig. 5 we indicate two observations that could be detected as outliers. The one indicated by the blue arrow is referred to as a global outlier, as it deviates from all the other observations and clusters in the dataset. The one indicated by a green arrow is so-called local outlier, as it deviates relatively to the other observations in a particular cluster (the lower cluster). The process of identifying such outliers is called unsupervised outlier detection.

ggplot(data = iris) + geom_point(mapping = aes(x = Sepal.Width, y = Petal.Width)) +
geom_segment(aes(x=2, y=0.1, xend=2.26, yend=0.27), colour = "blue", arrow = arrow()) +
geom_segment(aes(x=4.25, y=1, xend=4.38, yend=0.45), colour = "green", arrow = arrow())

Fig. 5. Outliers Example: iris dataset

Frequent Pattern Mining

Frequent pattern mining broadly refers to the process of finding strong associations between variables or sequences of values, which manifest themselves as recurring patterns in observed data. One of the frequent pattern mining techniques most widely used by data scientists in practice is association rule mining, which aims to detect interesting rules of association between the values of a collection of variables using some measure of interestingness. Even though association rules are nowadays applied to a variety of different fields, the most well known application is basket or transaction analysis. In this type of application the target database contains transactions, such as e.g. the collection of items purchased in a supermarket (hence the name “basket”). The goal is to find interesting rules of the type \({A_i, A_j, \ldots, A_k} \rightarrow A_m\), which would indicate that when items \({A_i, A_j, \ldots, A_k}\) are purchased together it is likely that item \(A_m\) will be purchased as well. This can be very useful for marketing strategies and decision making in a sales department.

Specialised Tasks

Recommendation

Recommendation is a predictive task in which one wants to predict whether or not (or to which extent) an individual will like a certain item, so that the item may possibly be recommended to that individual. Targeted individuals are typically online users (online shoppers, social media followers, streaming video/music subscribers, etc.) and recommended items are typically products (e.g. books, movies, songs, eletronics) or other users (e.g. a match in a dating site or a new friend/professional contact in a social networking platform).

A recommender system or a recommendation system (sometimes replacing “system” with a synonym such as platform or engine) is a subclass of information filtering system that seeks to predict the “rating” or “preference” that a user would give to an item.[1][2] [Wikipedia Entry on Recommender Systems]

Recommendation can sometimes be seen as a very specialised type of classification or regression. For instance, if one wants to predict the rating, say, within the interval \([0,5]\), which a user would assign to a certain item to reflect his/her preference after having interacted with that item, then the predicted rating is a numerical variable and the problem can be seen as somehow related to regression. On the other hand, if one wants to predict if a user would like or dislike a certain item based on a model of the user’s past preferences and behaviours, then the problem can be seen as a particular type of binary classification. However, different from general purpose models for classification and regression, the approaches used in recommendation are highly specialised to the particular type of recommendation task and strategy in question. For example, a product can be recommended to a user for being similar to other products that the user has previously purchased, or for being highly rated by other users that are similar to the targeted user, or both, each of which requires specific techniques that make use of domain-specific information.

Other Tasks

There are many other data mining tasks which oftentimes are special cases, variants, or application-oriented specialisations of one of those discussed above. Here is a list with a few examples:

One-class classification: it is a variant of classification in which only labels for one class are available. For example, in certain applications only labels for a “normal” class are available, as the abnormal class is too rare (e.g. a very rare disease) or yet to be observed (e.g. an unknown/unprecedented cyber attack). This is a special case of classification and, at the same time, it is also a special case of anomaly detection, so-called semi-supervised outlier detection.
Sentiment Analysis: “… (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information” [Wikipedia Entry on Sentiment Analysis]
Community Detection in Social Networks: “In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally” [Wikipedia Entry on Community structure]
Link Analysis/Prediction: “In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes… Link analysis has been used for investigation of criminal activity (fraud detection, counterterrorism, and intelligence), computer security analysis, search engine optimization, market research, medical research, and art” [Wikipedia Entry on Link Analysis]

The above list is by no means meant to be exhaustive. For instance, later on in this subject we will discuss a classic technique called Principal Component Analysis (PCA), which is widely used for dimensionality reduction and data visualisation. Since the material in our lecture notes can not be exhaustive, we have set a Discussion Board to further explore the current topic on Other Tasks in data mining.

Data Mining, Statistical Learning and Machine Learning

Data mining overlaps with other related areas, most noticeably statistical learning and machine learning. These areas are sometimes misinterpreted as just different names referring to the same thing, as there is a high degree of overlap between them.

Statistical Learning refers broadly to the process of learning — by using some sort of statistically sound technique — a model that helps understand a data sample and possibly the real-world phenomenon or phenomena that are represented by that observed sample. Statistical learning is the term most commonly used by statisticians when referring to data mining, since everything in statistical learning is essentially data mining. However, there are certain data mining techniques that do not fully fit the description of statistical learning either because no model of the data is really learnt, or because the process used for learning is more computational or heuristic than statistical in nature, or both. Some statisticians may argue, for instance, that certain association rule mining methods are not doing any statistical learning because they are just performing a computationally clever procedure to efficiently compute the frequency of many possible patterns in order to select the most interesting ones. This is a debate that is beyound the scope of this subject though.

Machine Learning is another area that highly overlaps with both data mining and statistical learning, but is closer to computing science than to statistics. In his classic monograph, Mitchell (1997) characterizes machine learning as the process whereby computers automatically learn how to solve taks from previous experience. In particular, Mitchell (1997) states that a computer is said to learn from experience E with respect to some class of tasks T and performance measure P if their performance on tasks T, as measured by P, improves from experience E. It is worth noticing that virtually everything that fits the description of statistical learning also fits the description of machine learning. For instance, in predictive tasks we can just define task T as the prediction in question, experience E as the dataset of observations \((X,Y)\), and P as a suitable goodness-of-fit measure for the task in hand (e.g. regression or classification). As for data mining, there is a high level of overlap with machine learning, but not everything in data mining is machine learning and vice-versa. For instance, a robot automatically learning a map by navigating in the environment while mapping obstacles is certainly machine learning, but not data mining in its essence (even though some of the algorithms used internally in the robot may also be useful in data mining applications). The limits between the areas are blurred though, there is no universal convention; and again, this is a debate that is beyound the scope of this subject.

Activities

Discussion Board

Do some research on your own about important tasks in data mining and their Main Challenges in the Era of Big Data.
Try to identify tasks that have not been explicitly discussed in the lecture notes above for week 1, and the main application domain(s) of each of them. These can (but not necessarily need to) include the tasks just briefly listed in subsection other tasks.
Discuss with your colleagues which of these tasks you judge the most important ones and why (in general or for you personally, in your workplace).
Discuss with your colleagues whether or not the tasks you have identified can be categorised as being variants, specialisations, or particular cases of one of the tasks that have been discussed in the lecture notes.
Does Big Data poses any particular challenges to these tasks?

Week 1 Quiz

Complete the week 1 online quiz in Blackboard Learn Ultra.

References

Han, J., M. Kamber, and J. Pei. 2012. Data Mining: Concepts and Techniques. 3rd ed. Morgan Kaufmann.

Mitchell, T. M. 1997. Machine Learning. McGraw-Hill.

Tan, P.-N., M. Steinbach, and V. Kumar. 2006. Introduction to Data Mining. Addison Wesley.

Zaki, M. J., and W. Meira JR. 2014. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press.

Introduction to Data Mining - Week 1

Ricardo Campello

31 January 2018