Association Rules

Frequent pattern mining broadly refers to the process of finding strong associations between variables or sequences of values, which manifest themselves as recurring patterns in observed data. One of the frequent pattern mining techniques most widely used by data scientists in practice is association rule mining, which aims to detect interesting rules of association between the values of a collection of variables using some measure of interestingness. Even though association rules are nowadays applied to a variety of different fields, the most well known application is basket or transaction analysis.

In basket/transaction analysis the target database contains transactions, such as e.g. the collection of items purchased in a supermarket (hence the name “basket”). The goal is to find interesting rules of the type ${A_i, A_j, \ldots, A_k} \implies A_m$, which would indicate that when items ${A_i, A_j, \ldots, A_k}$ are purchased together it is likely that item $A_m$ will be purchased as well. This can be very useful, e.g., for marketing strategies and decision making in a sales department, but association rules also find applications in a broad range of other domains, such as meteorology, biostatistics, and finances, just to mention a few.

Let’s consider the following illustrative example of a basket database containing 12 transactions:

Table 1: Basket database with 12 transactions
Transaction ID	ITEMS
1	COOKIES, JUICE
2	BREAD, BUTTER, DIAPERS, MILK
3	DIAPERS, JUICE
4	BREAD, BUTTER, DIAPERS, MILK
5	BREAD, BUTTER, DIAPERS, MILK
6	COOKIES, DIAPERS, JUICE
7	BREAD, BUTTER, COOKIES
8	BREAD, COOKIES, DIAPERS, MILK
9	BREAD, BUTTER, MILK
10	BREAD, DIAPERS, JUICE, MILK
11	BUTTER, DIAPERS
12	BREAD, BUTTER, DIAPERS

For practical purposes, this type of database can be represented in an alternative way, where each item is considered a binary variable for which a value of 1 or 0 represents the presence or the absence of that item in a transaction, respectively. Because usually only the presence of items in transactions is relevant from a basket analysis perspective, these variables are treated as binary assimetric. Transaction databases with binary asymmetric variables are very common in practical applications, so we will consider this case in the following discussions. Transaction databases with non-binary variables, such as those that account for the amount of a certain item (e.g. how many bottles of juice, etc.) in each transaction, are discussed e.g. in Chapter 7 of (Tan, Steinbach, and Kumar 2006) (see Recommended Reading).

Before we present the most well known algorithm for association rule mining, the Apriori Algorithm, we need first to introduce the following fundamental definitions:

Definition 1 (Itemset): Let $T$ be the collection of items in a transaction database. Any subset of items $I \subseteq T$ is referred to as an itemset.

Examples of itemsets from the collection of items in Table 1, T = {Bread,Butter,Cookies,Diapers,Juice,Milk}, are I = {Milk}, I = {Bread,Juice}, I = {Bread,Butter,Cookies}, etc. Itemsets of interest in association rule mining can be an entire transaction or part of a transaction.

Definition 2 (Association Rule): An association rule $R$ is a rule of the form $I \implies J$, where $I \subset T$ and $J \subset T$ are disjoint itemsets, i.e., $I \cap J = \emptyset$.

For example, given an itemset {A,B,C} with three items, the association rules $R: I \implies J$ that can possibly follow from this itemset (assuming $I, J \neq \emptyset$) are:

{A} $\implies$ {B, C}
{A, B} $\implies$ {C}
{A, C} $\implies$ {B}
{B} $\implies$ {A, C}
{B, C} $\implies$ {A}
{C} $\implies$ {A, B}

Notice that an association rule $R: I \implies J$ should not be interpreted as cause and effect. Rather, it should be interpreted as “when an itemset $I$ is present in a transaction, itemset $J$ is often also present”. How “often” can be quantified using the following definitions of support and confidence.

Definition 3 (Support of an Itemset): The support of an itemset $I$, $Sup(I)$, is the number of transactions that contain $I$ in a transaction database.

For example, the support of itemset I = {Bread,Butter,Diapers} in Table 1 is Sup(I) = 4, while Sup({Bread,Butter}) = 6.

Definition 4 (Support of a Rule): The relative support of an association rule, $R: I \implies J$, is the fraction of transactions that contain all the items involved in the rule ($I \cup J$), i.e., $Sup(R) = Sup(I \cup J) / N$, where $N$ is the total number of transactions.

Notice that the total number of transactions in the database is fixed, hence the support of a rule is determined by the support of itemset $I \cup J$. For example, in Table 1 the total number of transactions is $N = 12$, and the support of rule {Bread} $\implies$ {Butter} is 6/12 = 0.5, because Sup({Bread,Butter}) = 6. This means that {Bread} and {Butter} are simultaneously present in 50% of the transactions. Rules with support above a certain user-defined threshold are sometimes called strong rules. The support tells only part of the story though. A supplementary measure of interestingness of an association rule is confidence.

Definition 5 (Confidence of a Rule): The confidence of an association rule, $R: I \implies J$, is the fraction of transactions that contain all the items involved in the rule ($I \cup J$), among those that contain the items in the rule premise ($I$), i.e., $Conf(R) = Sup(I \cup J) / Sup(I)$.

For example, the confidence of rule {Bread} $\implies$ {Butter} is 6/8 = 0.75, because Sup({Bread,Butter}) = 6 and Sup({Bread}) = 8. This means that 75% of the transactions that contain {Bread} also contain {Butter}. Clearly, given two swapped rules $R_1: I \implies J$ and $R_2: J \implies I$, it follows that $Sup(R_1) = Sup(R_2)$, but this property does not necessarily hold true for confidence. For instance, the confidence of rule {Butter} $\implies$ {Bread} is 6/7 $\approx$ 0.86, because Sup({Bread,Butter}) = 6 and Sup({Butter}) = 7, so approximately 86% of the transactions that contain {Butter} also contain {Bread}.

Notice that support and confidence address different aspects of interestingness of a rule. High values of support mean that the items in the rule occur simultaneously very often in transactions, while a high confidence value means that when an itemset $I$ is present in a transaction, itemset $J$ is often also present. Notice that some rules may have low support yet high confidence, for example rules involving highly specialised and/or expensive items which are almost always purchased together but account only for a very small fraction of the transactions in a department store.

Association rules can be mined using two user-defined thresholds, $Sup_{min}$ and $Conf_{min}$, for support and confidence, respectively. The basic problem is then to find the rules $R$ for which $Sup(R) \geq Sup_{min}$ and $Conf(R) \geq Conf_{min}$. The most usual strategy to address this problem is to first find all the itemsets that meet the minimum support threshold. These itemsets are called frequent itemsets. Once all frequent itemsets have been retrieved, the low confidence rules deriving from these itemsets are filtered out, and only those rules that meet the minimum confidence threshold are returned to the analyst.

A Naive Algorithm

A basic, naive algorithm for association rule mining following the above strategy could be as follows:

Naive Algorithm:

$k \leftarrow 1$ (set iteration counter $k$ to 1)
Repeat while $k \leq |T|$ (where $|T|$ is the total number of items in the database):
1. Generate all possible itemsets of size $k$ (itemsets with $k$ items, i.e., with cardinality $k$)
2. Traverse the database and compute the support of all candidate itemsets of size $k$
3. Filter out those candidates below the threshold $Sup_{min}$ (i.e., keep only the frequent itemsets)
4. $k \leftarrow k + 1$
For each frequent itemset $F$ obtained in Step 2:
1. Derive all possible rules from $F$, i.e., all rules $I \implies J$ such that $I \cup J = F$ and $I \cap J = \emptyset$
2. Traverse the database and compute the confidence of all candidate rules
3. Filter out those candidates below the threshold $Conf_{min}$ (i.e., keep only the confident rules)

Although this algorithm will work just fine for small databases such as the one in Table 1, there is a fundamental problem that makes it prohibitive for practical applications involving databases with a large number of transactions and, especially, with a large number of items. The problem is that there are $2^{|T|} - 1$ possible itemsets of $|T|$ items, and all of those itemsets are considered as candidates in the naive algorithm above.

For instance, if there are 3 items, A, B and C, there are $2^3 - 1 = 7$ candidate itemsets, namely, {A}, {B}, {C}, {A,B} {A,C}, {B,C}, {A,B,C}. But if there are, say, 20 items, the number of itemsets is already larger than a million, and can quickly become astronomical by slightly increasing $|T|$ further. Apart from the time to traverse the database and compute their support, just generating these itemsets takes $O(2^{|T|})$ time, i.e., it has exponential computational complexity in the number of items. This is not doable in most practical applications unless clever strategies to reduce the computational burden of the algorithm are in place.

There is a broad literature on strategies to speed up different aspects of the association rule mining process, and a detailed discussion of this topic is beyond the scope of this subject. Here, we limit the discussion to the most fundamental principle, the apriori principle, which serves as a basis for a variety of these strategies. In particular, we focus on the main idea behind a very popular association rule mining algorithm, Apriori, which follows this principle.

The Apriori Algorithm

The Apriori Algorithm is possibly the most well known of many algorithms based on the apriori principle:

Apriori Principle: if an itemset is frequent in the sense that its support is above the minimum threshold, then any of its subsets is also necessarily frequent. Conversely, if an itemset is infrequent in the sense that its support is below the minimum threshold, then any of its supersets is also necessarily infrequent.

This principle results from the following property of support:

Anti-Monotonicity: for any itemsets $X$ and $Y$ such that $X \subseteq Y$, it follows that $Sup(X) \geq Sup(Y)$.

The anti-monotonicity property clearly follows from the fact that all transactions containing an itemset $Y$ will also contain any subset $X$ of $Y$, and by removing items from $Y$ it is possible that the resulting subset $X$ will also be contained in other transactions that do not contain $Y$. For instance, let us consider again the example in Table 1. The support of itemset $Y$ = {Bread, Butter, Diapers} is $Sup(Y)$ = 4 because this itemset is contained in 4 transactions (IDs 2, 4, 5, and 12). Any itemset $X \subseteq Y$, e.g. {Bread, Butter}, will also be contained in those 4 transactions, so $Sup(X)$ must be at least 4. But since {Bread,Butter} also appears in 2 other transactions (IDs 7 and 9), its support is actually $Sup(X) = 6$.

The apriori principle results directly from the anti-monotonicity property because, if $Y$ is a frequent itemset, then $Sup(Y) \geq Sup_{min}$, hence for any $X \subseteq Y$ it follows that $Sup(X) \geq Sup(Y) \geq Sup_{min}$, i.e., $X$ is also a frequent itemset. Conversely, if $X$ is not frequent, i.e., $Sup(X) < Sup_{min}$, the anti-monotonicity property ensures that $Y$ cannot be frequent, i.e., $Sup(Y) \leq Sup(X) < Sup_{min}$. This gives rise to the following pruning rule that allows the Apriori Algorithm to neglect all candidate itemsets for which any of their subsets has already been found to be infrequent:

Pruning Rule: Including new items into an itemset $X$ that has already been found to be infrequent is useless as it can only produce infrequent itemsets $Y \supset X$, so, once an infrequent itemset $X$ is discovered, no candidate itemset that is a superset of $X$ needs to be checked; these can be discarded beforehand.

This pruning rule can be used in different ways to speed up the Naive Algorithm previously discussed. The simplest way is the following: in Step 2.i of the algorithm, instead of generating all possible itemsets of size $k$, we need to generate only those resulting from the union $F_{k-1} \cup F_1$ of frequent itemsets $F_{k-1}$ of size $k-1$ and frequent itemsets $F_1$ of size 1 (in particular, if $F_{k-1} \cap F_1 = \emptyset$, then obviously $|F_{k-1} \cup F_1| = k$).

As an example, let’s consider again the database in Table 1. Assume that we want to find the association rules $R: I \implies J$ for which $Sup(R) \geq 5/12$ and $Conf(R) = 1$. Notice that the database contains $N = 12$ transactions and, according to Definition 4, $Sup(R) = Sup(I \cup J)/N$, hence the requirement $Sup(R) \geq 5/12$ means that the support of the itemset (Definition 3) $I \cup J$ containing the items in any rule $R$ that meets the requirement must be such that $Sup(I \cup J) \geq 5$. In other words, valid candidate itemsets must appear in at least 5 transactions.

The algorithm starts by considering itemsets of size $k = 1$. From Table 1 is easy to compute the support of each candidate as follows:

Table 2: Support of Itemsets of size $k = 1$
Itemset	Support
{Bread}	8
{Butter}	7
{Cookies}	4
{Diapers}	9
{Juice}	4
{Milk}	6

Clearly, {Cookies} and {Juice} are infrequent and, according to the apriori principle, no itemset containing these items can be frequent, thus they will not be considered any further, saving the algorithm a significant amount of runtime.

Once the infrequent itemsets of size $k = 1$ have been filtered out, we can combine the remaining (i.e. frequent) candidates to produce candidate itemsets of size $k = 2$. By doing so and scanning Table 1 to compute their support we get the following:

Table 3: Support of Itemsets of size $k = 2$
Itemset	Support
{Bread,Butter}	6
{Bread,Diapers}	6
{Bread,Milk}	6
{Butter,Diapers}	5
{Butter,Milk}	4
{Diapers,Milk}	5

Clearly, {Butter,Milk} is infrequent and, according to the apriori principle, no itemset containing these two items together can be frequent, thus they will not be considered any further. Once this infrequent itemset of size $k = 2$ has been filtered out, we can combine the remaining (i.e. frequent) candidates of size $k = 2$ with the corresponding disjoint frequent candidates of size $k = 1$ to produce candidate itemsets of size $k = 3$. By doing so and scanning Table 1 to compute their support we get the following:

Table 4: Support of Itemsets of size $k = 3$
Itemset	Support
{Bread,Butter,Diapers}	4
{Bread,Butter,Milk}	—
{Bread,Diapers,Milk}	5
{Butter,Diapers,Milk}	—

Notice that the support of itemsets {Bread,Butter,Milk} and {Butter,Diapers,Milk} have not been displayed in Table 4 to call attention to the fact that the algorithm can save time by not computing them. The reason why these two values do not need to be computed is that the algorithm already knows beforehand that the corresponding itemsets cannot be frequent, as they contain an infrequent subset, namely, {Butter,Milk} (see Table 3).

Finally, from the only frequent itemset of size $k = 3$ in Table 4 ({Bread,Diapers,Milk}) and the disjoint frequent itemsets of size $k = 1$ in Table 2 we can produce a single candidate itemset of size $k = 4$, {Bread,Butter,Diapers,Milk}, which again, we already know it cannot be frequent as it contains infrequent subsets.

At this point, the frequent itemset generation procedure stops as no more frequent itemsets can be discovered. The frequent itemsets discovered by the algorithm are those in blue in Tables 2, 3, and 4 (four of size 1, five of size 2, and one of size 3). Association rules can then be generated from the six frequent itemsets of sizes 2 and 3, and filtered according to the confidence threshold (any rule with confidence less than 100% should be discarded). The rule generation and filtering procedure can also be implemented in an efficient way by making use of the apriori principle, but in a slightly different way — see e.g. (Tan, Steinbach, and Kumar 2006) for details. For the sake of simplicity, here we just list all possible rules and their corresponding support and confidence values.

Candidate rules from the itemsets of size $k = 2$ (compute Sup and Conf of the rules manually yourself as an EXERCISE!):

R1: {Bread} $\implies$ {Butter} ($Sup(R1) = 6/12$, $Conf(R1) = 6/8$)
R2: {Butter} $\implies$ {Bread} ($Sup(R2) = 6/12$, $Conf(R2) = 6/7$)
R3: {Bread} $\implies$ {Diapers} ($Sup(R3) = 6/12$, $Conf(R3) = 6/8$)
R4: {Diapers} $\implies$ {Bread} ($Sup(R4) = 6/12$, $Conf(R4) = 6/9$)
R5: {Bread} $\implies$ {Milk} ($Sup(R5) = 6/12$, $Conf(R5) = 6/8$)
R6: {Milk} $\implies$ {Bread} ($Sup(R6) = 6/12$, $Conf(R6) = 6/6$)
R7: {Butter} $\implies$ {Diapers} ($Sup(R7) = 5/12$, $Conf(R7) = 5/7$)
R8: {Diapers} $\implies$ {Butter} ($Sup(R8) = 5/12$, $Conf(R8) = 5/9$)
R9: {Diapers} $\implies$ {Milk} ($Sup(R9) = 5/12$, $Conf(R9) = 5/9$)
R10: {Milk} $\implies$ {Diapers} ($Sup(R10) = 5/12$, $Conf(R10) = 5/6$)

We can see that only one rule (R6: {Milk} $\implies$ {Bread}) satisfies the minimum required support of 5/12 and confidence of 1.

Exercise: List the candidate rules from the itemset of size $k = 3$, compute their support and confidence, and indicate those rules (if any) that meet the requirements.

Apriori in R

Frequent itemsets and association rules can be easily mined in R, using, e.g., package arules. This package can handle different representations of transaction data, and contains functions that allow conversions (coercing) between these representations, such as data frames, binary matrices, and lists. Since transaction data is usually very sparse, because each transaction tends to contain only a very small fraction of the items in the database, one of the most efficient data representations for this type of data is a list of lists. In particular, we can represent a transaction database as a list of transactions, where each element of this list (i.e., a transaction) is a list on its own (i.e., a list of items). The following R code represents the data in Table 1 using this strategy (with transactions represented as a vector of strings):

Tr_list <- list(
           c("COOKIES", "JUICE"),
           c("BREAD", "BUTTER", "DIAPERS", "MILK"),
           c("DIAPERS", "JUICE"),
           c("BREAD", "BUTTER", "DIAPERS", "MILK"),
           c("BREAD", "BUTTER", "DIAPERS", "MILK"),
           c("COOKIES", "DIAPERS", "JUICE"),
           c("BREAD", "BUTTER", "COOKIES"),
           c("BREAD", "COOKIES", "DIAPERS", "MILK"),
           c("BREAD", "BUTTER", "MILK"),
           c("BREAD", "DIAPERS", "JUICE", "MILK"),
           c("BUTTER", "DIAPERS"),
           c("BREAD", "BUTTER", "DIAPERS")
           )

names(Tr_list) <- paste("Tr_ID_",c(1:12),sep="") # Just give names "Tr_ID_1", ...,"Tr_ID_12" to transactions

head(Tr_list, 5) # Shows first 5 transactions

## $Tr_ID_1
## [1] "COOKIES" "JUICE"  
## 
## $Tr_ID_2
## [1] "BREAD"   "BUTTER"  "DIAPERS" "MILK"   
## 
## $Tr_ID_3
## [1] "DIAPERS" "JUICE"  
## 
## $Tr_ID_4
## [1] "BREAD"   "BUTTER"  "DIAPERS" "MILK"   
## 
## $Tr_ID_5
## [1] "BREAD"   "BUTTER"  "DIAPERS" "MILK"

Data set Tr_list can then be converted to a standard form used by the functions in package arules, namely, an object of class transactions.

library(arules)
Tr_object <- as(Tr_list, "transactions")

We can inspect this type of object in different ways, for instance using summary() or inspect():

summary(Tr_object)

## transactions as itemMatrix in sparse format with
##  12 rows (elements/itemsets/transactions) and
##  6 columns (items) and a density of 0.5277778 
## 
## most frequent items:
## DIAPERS   BREAD  BUTTER    MILK COOKIES (Other) 
##       9       8       7       6       4       4 
## 
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 
## 3 4 5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.750   3.000   3.167   4.000   4.000 
## 
## includes extended item information - examples:
##    labels
## 1   BREAD
## 2  BUTTER
## 3 COOKIES
## 
## includes extended transaction information - examples:
##   transactionID
## 1       Tr_ID_1
## 2       Tr_ID_2
## 3       Tr_ID_3

inspect(Tr_object)

##      items                        transactionID
## [1]  {COOKIES,JUICE}              Tr_ID_1      
## [2]  {BREAD,BUTTER,DIAPERS,MILK}  Tr_ID_2      
## [3]  {DIAPERS,JUICE}              Tr_ID_3      
## [4]  {BREAD,BUTTER,DIAPERS,MILK}  Tr_ID_4      
## [5]  {BREAD,BUTTER,DIAPERS,MILK}  Tr_ID_5      
## [6]  {COOKIES,DIAPERS,JUICE}      Tr_ID_6      
## [7]  {BREAD,BUTTER,COOKIES}       Tr_ID_7      
## [8]  {BREAD,COOKIES,DIAPERS,MILK} Tr_ID_8      
## [9]  {BREAD,BUTTER,MILK}          Tr_ID_9      
## [10] {BREAD,DIAPERS,JUICE,MILK}   Tr_ID_10     
## [11] {BUTTER,DIAPERS}             Tr_ID_11     
## [12] {BREAD,BUTTER,DIAPERS}       Tr_ID_12

We can also inspect it visually as a binary matrix:

image(Tr_object)

Fig. 1. Illustrative data set in Table 1: visualisation of object of class transactions

As a transactions object, data set Tr_object can be given as input to function apriori() from package arules, which can be used to extract frequent itemsets and/or association rules from the data, depending on the list of parameters passed as argument. The parameters list allows the user to set a large variety of attributes. The most common attributes are support (value within $[0,1]$), which establishes the minimum relative support for itemsets or rules, confidence (value within $[0,1]$), which establishes the minimum confidence for rules and is therefore used only when extracting rules, minlen and maxlen (positive integers), which establish the minimum and maximum number of items in a frequent itemset or rule to be returned, respectively, target, which determines the type of result to be returned (e.g. “frequent itemsets” or “rules”), and maxtime, which sets the maximal time for subset checking, in seconds. If a parameter is not specified, default values are used (support = 0.1, confidence = 0.8, minlen = 1, maxlen = 10, maxtime = 5).

For instance, the following R code extracts from Tr_object the frequent itemsets with minimum support of 5/12, and then displays them in decreasing order of support (compare with Tables 2, 3, and 4):

FI <- apriori(Tr_object, parameter = list(support = 5/12, target = "frequent itemsets"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime   support
##          NA    0.1    1 none FALSE            TRUE       5 0.4166667
##  minlen maxlen            target   ext
##       1     10 frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 5 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 12 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [10 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(sort(FI, by="support"))

##      items                support   count
## [1]  {DIAPERS}            0.7500000 9    
## [2]  {BREAD}              0.6666667 8    
## [3]  {BUTTER}             0.5833333 7    
## [4]  {MILK}               0.5000000 6    
## [5]  {BREAD,BUTTER}       0.5000000 6    
## [6]  {BREAD,MILK}         0.5000000 6    
## [7]  {BREAD,DIAPERS}      0.5000000 6    
## [8]  {BUTTER,DIAPERS}     0.4166667 5    
## [9]  {DIAPERS,MILK}       0.4166667 5    
## [10] {BREAD,DIAPERS,MILK} 0.4166667 5

The following R code extracts from Tr_object the association rules with with minimum support of 5/12 and confidence 100% (notice that, since we are interested in rules with at least one item in the left-hand-side and at least one item in the right-hand-side, i.e., $R: I \implies J$ where both $I$ and $J$ are non-empty, we set minlen to 2 in this case):

AR <- apriori(Tr_object, parameter = list(support = 5/12, confidence = 1, minlen = 2, target = "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime   support
##           1    0.1    1 none FALSE            TRUE       5 0.4166667
##  minlen maxlen target   ext
##       2     10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 5 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 12 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(AR)

##     lhs               rhs     support   confidence lift count
## [1] {MILK}         => {BREAD} 0.5000000 1          1.5  6    
## [2] {DIAPERS,MILK} => {BREAD} 0.4166667 1          1.5  5

Notice that, as expected, the resulting two rules match exactly those that we had already manually computed.

Further details on more sophisticated functionalities of apriori() and other functions from the arules package (e.g., how to enforce rule templates/constraints, such as find only rules with a specific variable on the left- or right-hand-side, etc.) can be found e.g. in the CRAN Package Documentation or in the Arules Tutorial by Michael Hahsler.

Exercise: Consider the following data set with 10 transactions. The data is represented as a binary matrix where each column (except for the 1st, which contains the transaction ID) stands for an item and the presence or absence of that item in each transaction is denoted by a value 1 or 0, respectively.

Table 5: Basket database with 10 transactions
ID	Milk	Coffee	Tea	Bread	Butter	Rice	Beans
1	0	1	0	1	1	0	0
2	1	0	1	1	1	0	0
3	0	1	0	1	1	0	0
4	1	1	0	1	1	0	0
5	0	0	1	0	0	0	0
6	0	0	0	0	1	0	0
7	0	0	0	1	0	0	0
8	0	0	0	0	0	0	1
9	0	0	0	0	0	1	1
10	0	0	0	0	0	1	0

This data set is available in a CSV file 10_Groceries_Transactions.csv. This file has a header with the variable names in the 1st line, the other 10 lines represent the transactions. Each line contains 8 values (columns) separated by comma (“,”). In this exercise you are asked to:

Read the file into a data.frame called TR_10_Frame using read.table();
Remove the 1st column (Transaction IDs) and represent the remaining 7 columns as a binary matrix called TR_Matrix;
Convert this matrix into an object of class transactions using function as(), and name this object TR_obj;
Inspect and visualise this object using functions summary(), inspect() and image();
Use functions apriori() and inspect() to generate the rules with minimum relative support of 3/10 and minimum confidence of 9/10.

Additional Activity and YouTube Video: Explore the transaction data Groceries from package arules (9835 transactions of 169 items). Inspect the data and mine frequent itemsets as well as association rules, trying different values for the minimum support threshold, confidence threshold, and other parameters in the apriori() function. For instance, try to reproduce the results in this YouTube Video (which also includes extensions of the arules package for visualisation of frequent itemsets and association rules).

Recommender Systems

Recommendation is a predictive task in which one wants to predict whether or not (or to which extent) an individual will like a certain item, so that the item may possibly be recommended to that individual. Targeted individuals are typically online users (online shoppers, social media followers, streaming video/music subscribers, etc.) and recommended items are typically products (e.g. books, movies, songs, electronics) or other users (e.g. a match in a dating site or a new friend/professional contact in a social networking platform).

A recommender system … is a subclass of information filtering system that seeks to predict the “rating” or “preference” that a user would give to an item… Recommender systems typically produce a list of recommendations in one of two ways - through collaborative filtering or through content-based filtering… Collaborative filtering approaches build a model from a user’s past behaviour (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties [Wikipedia Entry on Recommender Systems]

The differences… can be demonstrated by comparing two popular music recommender systems - Last.fm and Pandora Radio. Last.fm creates a “station” of recommended songs by observing what bands and individual tracks the user has listened to on a regular basis and comparing those against the listening behavior of other users. Last.fm will play tracks that do not appear in the user’s library, but are often played by other users with similar interests… it is an example of a collaborative filtering technique. Pandora uses the properties of a song or artist… to seed a “station” that plays music with similar properties… deemphasizing certain attributes when a user “dislikes” a particular song and emphasizing other attributes when a user “likes” a song. This is an example of a content-based approach. Each type of system has its strengths and weaknesses. [Wikipedia Entry on Recommender Systems]

Customer analysis and targeted advertising has been around for decades, but the boom in modern recommendation systems has been dramatically leveraged by the advent of the internet. As an example of its relevance in the online domain, in 2006, Netflix launched an open competition challenging individuals or teams worldwide to develop new strategies aimed to improve on the algorithm used by the company back then to predict user ratings for films, based on previous ratings. Predicted ratings are important because they are used by the company to recommend content to users. Among other smaller prizes, Netflix offered a grand prize of US$1,000,000 to the first submission able to improve by at least 10% on the rating prediction accuracy of the company’s algorithm. The prize, so-called Netflix Prize, was awarded in 2009. Besides publicity, the company got access to new ideas and algorithms from people worldwide. An important share of the online business in companies such as Netflix and Amazon comes from recommending products to customers.

In the following, we discuss two of the most popular methods for rating prediction, User-Based KNN and Item-Based KNN, which are both based on the idea of $k$ Nearest Neighbours, already studied in this subject in the context of classification and outlier detection. Both these methods operate on a $N \times n$ user-item matrix, where each row corresponds to a user, each column corresponds to an item, and an entry $(u,i)$ contains a rating $r_{ui}$ given by user $u$ to item $i$. These ratings can be positive numerical values, typically integers from 1 (very low) to 5 (very high), case in which a value of 0 (zero) means that user $u$ has not rated item $i$; or they can be binary, case in which a value of 1 means that user $u$ is known (or believed — e.g. based on his/her shopping record track) to like item $i$, whereas a value of 0 means that there is no information available about the interaction of user $u$ with item $i$ (the user may not know the item, or may not like the item, or she/he may even like the item but this information is not available in the database). The goal is to predict the missing ratings, and then recommend items $i$ to users $u$ whose predicted ratings $\hat{r}_{ui}$ are high.

User-Based KNN

User-based Collaborative Filtering makes the assumption that users with similar preferences tend to rate items similarly. From this assumption, missing ratings for a user $u$ can then be predicted by somehow aggregating the ratings of a subset of the users most similar to $u$. The User-Based KNN method takes the $k$ nearest neighbours of $u$, $NN_k(u)$, as a reference subset of the users most similar to $u$. If the rating of user $u$ to an item $i$ is missing (i.e., $r_{ui} = 0$), the algorithm estimates it, in its simplest form, by just computing the average rating given to item $i$ by those neighbours of $u$ that have rated item $i$, i.e.:

Equation 1: \[\hat{r}_{ui} = \displaystyle \frac{\sum_{u' \in NN_k(u)} r_{u'i} \cdot \delta_{u'i}}{\sum_{u' \in NN_k(u)} \delta_{u'i}}\]

where $\delta$ is an indication function such that $\delta_{u'i} = 0$ if $r_{u'i} = 0$ or $\delta_{u'i} = 1$ otherwise.

This basic idea can be further improved in different ways. For example, instead of the arithmetic mean, a weighted average can be used where each rating $r_{u'i}$ is weighted by the similarity between the target user $u$ and its respective neighbour, $u'$. This can be readily achieved by just replacing $\delta_{u'i}$ with $s(u,u')$ in Equation 1, where $s(u,u')$ is a measure of similarity between users $u$ and $u'$ (typically the Pearson correlation or Cosine measure between the corresponding rows of the user-item matrix).

Another usual enhancement of the basic idea is to normalise the original user-item matrix to remove undesired biases, e.g., by subtracting from the non-zero ratings the (row-wise) mean ratings of the respective user, which accounts for biases caused by users who tend to give lower/higher scores overall. For more details, discussions and examples we refer the student, e.g., to (Hahsler 2017) and its vignette.

User-based KNN implementations are available in different R packages. Here we will use the implementation in package rrecsys. As an illustrative example, let’s consider a user-item rating matrix with $N = 20$ users, $n = 10$ items, and ratings $r_{ui} \in \{1,2,3,4,5\}$ ($r_{ui}=0$ means that user $u$ has not rated item $i$). The matrix is intentionally generated in a way that the first (last) 10 users tend to give high (low) ratings to the first 5 items yet low (high) ratings to the last 5 items. We could think of the first 5 items as horror movies and the last 5 items as comedy movies, and users who tend to prefer horror to comedy or vice versa. There is a 60% probability that a user will give a rating to an item, so we expect around 40% of empty entries.

set.seed(0)
ratings_11 <- matrix(sample(c(0:5),size=50,replace=TRUE,prob=c(.4,.02,.02,.04,.26,.26)), nrow=10, byrow=TRUE)
ratings_12 <- matrix(sample(c(0:5),size=50,replace=TRUE,prob=c(.4,.26,.26,.04,.02,.02)), nrow=10, byrow=TRUE)
ratings_21 <- matrix(sample(c(0:5),size=50,replace=TRUE,prob=c(.4,.26,.26,.04,.02,.02)), nrow=10, byrow=TRUE)
ratings_22 <- matrix(sample(c(0:5),size=50,replace=TRUE,prob=c(.4,.02,.02,.04,.26,.26)), nrow=10, byrow=TRUE)
myratings  <- rbind(cbind(ratings_11,ratings_12),cbind(ratings_21,ratings_22))
myratings

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    4    0    0    5    4    1    2    1    2     0
##  [2,]    0    4    3    4    5    0    0    0    2     1
##  [3,]    0    0    0    4    0    2    1    0    2     0
##  [4,]    4    5    4    1    0    2    0    2    1     0
##  [5,]    4    3    0    5    0    1    0    1    0     0
##  [6,]    0    0    0    0    4    2    1    1    0     1
##  [7,]    0    5    5    5    0    5    2    1    0     0
##  [8,]    4    4    4    0    4    1    0    1    0     0
##  [9,]    5    4    5    4    5    0    0    0    2     1
## [10,]    5    4    0    5    4    1    1    2    2     1
## [11,]    2    2    0    0    4    4    5    5    0     5
## [12,]    2    0    0    2    3    5    0    5    0     0
## [13,]    2    5    1    0    2    0    0    4    5     4
## [14,]    0    0    1    0    2    4    5    0    0     4
## [15,]    2    4    2    2    0    0    5    4    4     0
## [16,]    1    2    2    0    0    0    4    5    4     5
## [17,]    2    2    0    0    2    4    0    0    4     5
## [18,]    3    2    2    2    4    4    0    4    4     3
## [19,]    2    1    2    0    0    5    4    0    0     3
## [20,]    1    2    0    1    0    0    5    0    4     0

Once we have the user-item matrix, we can use function defineData() from package rrecsys to convert the matrix to a form that can be used by the recommender algorithms in that package. Specifically, defineData() converts an explicit or a sparse user-item matrix representation into an object of class dataSet. By default it keeps ratings within the interval [minimum,maximum] = [0.5, 5], but changing the default value of attribute binary to TRUE and setting a threshold positiveThreshold would result in a binary matrix where ratings above the threshold become 1 and the others become 0. In our example, we will work with the ratings as they are, so we don’t change the default settings:

library(rrecsys)
my_data_object <- defineData(myratings)

Now we can call the User-Based KNN method by setting attribute alg to UBKNN in function rrecsys():

r_model <- rrecsys(my_data_object, alg = "UBKNN", simFunct = "cos", neigh = 5)

Notice that attribute neigh corresponds to parameter $k$ of the method, i.e., the size of the neighbourhood. Different values of $k$ will produce different models. Once a model has been obtained, predictions can be made using function predict():

pred <- predict(r_model)

In predict(), all unrated items are predicted and the entire matrix is returned with the new ratings. To focus only on the predictions for the originally missing ratings, we replace the originally present ratings with NA:

estimated_ratings <- ifelse(myratings != 0, NA, pred)
print(estimated_ratings, digits = 4)

##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]   [,8]  [,9]  [,10]
##  [1,]    NA 3.068 3.022    NA    NA    NA    NA     NA    NA 1.2190
##  [2,] 3.873    NA    NA    NA    NA 2.780 3.167 3.2680    NA     NA
##  [3,] 3.851 3.317 2.250    NA 3.689    NA    NA 0.9529    NA 0.6217
##  [4,]    NA    NA    NA    NA 3.812    NA 2.714     NA    NA 1.2474
##  [5,]    NA    NA 3.940    NA 3.970    NA 1.376     NA 1.780 2.8000
##  [6,] 2.303 1.317 1.800 2.997    NA    NA    NA     NA 1.568     NA
##  [7,] 5.000    NA    NA    NA 5.000    NA    NA     NA 2.762 3.8333
##  [8,]    NA    NA    NA 4.088    NA    NA 3.000     NA 2.489 1.5537
##  [9,]    NA    NA    NA    NA    NA 2.793 3.714 3.4552    NA     NA
## [10,]    NA    NA 2.620    NA    NA    NA    NA     NA    NA     NA
## [11,]    NA    NA 2.385 3.643    NA    NA    NA     NA 4.207     NA
## [12,]    NA 3.493 3.643    NA    NA    NA 3.400     NA 2.982 3.3261
## [13,]    NA    NA    NA 3.313    NA 3.286 3.453     NA    NA     NA
## [14,] 2.813 2.093    NA 3.200    NA    NA    NA 3.4860 3.200     NA
## [15,]    NA    NA    NA    NA 3.671 2.874    NA     NA    NA 3.4986
## [16,]    NA    NA    NA 1.983 3.187 3.286    NA     NA    NA     NA
## [17,]    NA    NA 2.082 3.167    NA    NA 3.469 3.9250    NA     NA
## [18,]    NA    NA    NA    NA    NA    NA 3.505     NA    NA     NA
## [19,]    NA    NA    NA 2.833 2.504    NA    NA 4.0793 3.646     NA
## [20,]    NA    NA 1.619    NA 2.600 2.600    NA 2.5978    NA 1.5413

It can be difficult to interpret real-valued predicted ratings considering that the original ratings are integers, so we can use attribute Round = TRUE to generate ratings rounded to the nearest integer:

pred <- predict(r_model, Round = TRUE)
estimated_ratings <- ifelse(myratings != 0, NA, pred)
print(estimated_ratings)

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   NA  3.0  3.0   NA   NA   NA   NA   NA   NA   1.0
##  [2,]  4.0   NA   NA   NA   NA  3.0  3.0  3.5   NA    NA
##  [3,]  4.0  3.5  2.0   NA  3.5   NA   NA  1.0   NA   0.5
##  [4,]   NA   NA   NA   NA  4.0   NA  2.5   NA   NA   1.0
##  [5,]   NA   NA  4.0   NA  4.0   NA  1.5   NA  2.0   3.0
##  [6,]  2.5  1.5  2.0  3.0   NA   NA   NA   NA  1.5    NA
##  [7,]  5.0   NA   NA   NA  5.0   NA   NA   NA  3.0   4.0
##  [8,]   NA   NA   NA  4.0   NA   NA  3.0   NA  2.5   1.5
##  [9,]   NA   NA   NA   NA   NA  3.0  3.5  3.5   NA    NA
## [10,]   NA   NA  2.5   NA   NA   NA   NA   NA   NA    NA
## [11,]   NA   NA  2.5  3.5   NA   NA   NA   NA  4.0    NA
## [12,]   NA  3.5  3.5   NA   NA   NA  3.5   NA  3.0   3.5
## [13,]   NA   NA   NA  3.5   NA  3.5  3.5   NA   NA    NA
## [14,]  3.0  2.0   NA  3.0   NA   NA   NA  3.5  3.0    NA
## [15,]   NA   NA   NA   NA  3.5  3.0   NA   NA   NA   3.5
## [16,]   NA   NA   NA  2.0  3.0  3.5   NA   NA   NA    NA
## [17,]   NA   NA  2.0  3.0   NA   NA  3.5  4.0   NA    NA
## [18,]   NA   NA   NA   NA   NA   NA  3.5   NA   NA    NA
## [19,]   NA   NA   NA  3.0  2.5   NA   NA  4.0  3.5    NA
## [20,]   NA   NA  1.5   NA  2.5  2.5   NA  2.5   NA   1.5

In recommendation, the usual goal is to recommend items to users that are likely to appreciate those items, so, typically, the focus is on user-item pairs with high predicted ratings. We can write a function top_m() that extracts the user and item indexes, $u$ and $i$, corresponding to the m highest estimated ratings:

top_m <- function(M,m){
         top_n_indexes <- which(M>=sort(M, decreasing=TRUE)[m], arr.ind=TRUE)
         top_n_order   <- order(M[top_n_indexes], decreasing=TRUE)
         sorted_ind    <- top_n_indexes[top_n_order,]
         sorted_val    <- M[sorted_ind]
         return(data.frame(user=sorted_ind[,1], item=sorted_ind[,2], pred_rating=sorted_val)[1:m,])
}

We can then use this function to extract, say, the top 10 ratings for recommendation:

(top_ratings <- top_m(M = estimated_ratings, m = 10))

##    user item pred_rating
## 1     7    1           5
## 2     7    5           5
## 3     2    1           4
## 4     3    1           4
## 5     5    3           4
## 6     8    4           4
## 7     4    5           4
## 8     5    5           4
## 9    17    8           4
## 10   19    8           4

Notice that, in all top 10 recommendations above, users $u \leq 10$ have been recommended items $i \leq 5$, whereas users $u > 10$ have been recommended items $i > 5$.

Exercise: Modify function top_m(M,m) to get a function top_val(M,val) that returns the user and item indexes, $u$ and $i$, corresponding to all the estimated ratings $\hat{r}_{ui} \geq val$, then use this function to extract all the pairs user-item with $\hat{r}_{ui} \geq 4$ (along with their estimated ratings).

In a variety of applications the analyst may not need the estimated ratings for all user-item pairs, but rather the top $m$ recommendations for each user only. There are ways to speed up computations when only the top $m$ recommendations are required, which can be more efficient than generating all estimated ratings and then extracting the highest ones. Function recommendHPR() from package rrecsys recommends the highest predicted ratings on each user:

recommendHPR(r_model, topN = 1) # Recommending only the highest predicted rating of each user

## [[1]]
## [1] 2
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 5
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 4
## 
## [[7]]
## [1] 1
## 
## [[8]]
## [1] 4
## 
## [[9]]
## [1] 7
## 
## [[10]]
## [1] 3
## 
## [[11]]
## [1] 9
## 
## [[12]]
## [1] 3
## 
## [[13]]
## [1] 7
## 
## [[14]]
## [1] 8
## 
## [[15]]
## [1] 5
## 
## [[16]]
## [1] 6
## 
## [[17]]
## [1] 8
## 
## [[18]]
## [1] 7
## 
## [[19]]
## [1] 8
## 
## [[20]]
## [1] 5

Many other different recommendation strategies that are also based on the concept of nearest neighbours can used in practical applications. For instance, the user-item matrix can be binarised by setting the ratings greater than a user-defined threshold equal to 1 and the other entries equal to zero. Then, each user $u$ can be recommended the most frequent item(s) in $u$’s neighborhood, among those items that have not been rated by $u$. In the following, we briefly discuss an additional strategy that is also based on the idea of nearest neighbours, but it operates on items rather than on users as a reference for neighbourhood computation and prediction.

Item-Based KNN

Item-Based KNN is structurally very similar to User-Based KNN, but it works taking items rather than users as a reference for neighbourhood computation and ratings prediction. We can think of Item-Based KNN as the User-Based KNN method operating on columns rather than on rows of the user-item matrix. The assumption behind this approach is that users will prefer items that are similar to other items that they like. For more details, discussions and examples we refer the student, e.g., to (Hahsler 2017) and its vignette.

Exercise: Item-Based KNN can be selected by setting alg = "IBKNN" in function rrecsys(). Reproduce the previous analyses for data set myratings now using Item-Based KNN instead of User-Based KNN. Use the same neighbourhood size, $k = 5$.

Model Selection

In the previous examples and exercises we used a neighbourhood size $k = 5$. Since a user-item matrix with known ratings is available, in practice we can perform model selection by varying $k$ and choosing the value that provides the most accurate results. We can apply cross-validation in a way similar to classification (but operating on matrix entries, i.e., splitting user-item pairs across multiple folds, rather than users or items), using some quality measure for assessment of recommendation results. Two of the most usual measures in recommendation are the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) (Hahsler 2017), which are standard measures of (absolute or squared) errors averaged over the predicted values, as also usual in regression analysis. The following code splits the data into 2 cross-validated folds using function evalModel() from package rrecsys, then it uses function evalPred to apply the Item-Based KNN method with $k = 5$ to model the data using the training folds and predict the ratings in the test folds (only 2 folds have been used because the data set is very small):

set.seed(0)
CV_Model <- evalModel(data = my_data_object, folds = 2) # Cross-Validation Model with 2 Folds
IBEvaluation <- evalPred(CV_Model, "IBKNN", simFunct = "cos", neigh = 5)

## Train set 1 / 2 created with  20 users,  10 items and  62  ratings.
## Neighborhood calculated in:  0 seconds.
## 
## Fold: 1 / 2  elapsed. Time: 0.00404191 
## 
## Train set 2 / 2 created with  20 users,  10 items and  64  ratings.
## Neighborhood calculated in:  0.001002073 seconds.
## 
## Fold: 2 / 2  elapsed. Time: 0.03007817 
## 
## The model was trained on the dataset using  IBKNN algorithm. 
## The algorithm was configured with the following neighborhood width: 5

IBEvaluation["Average",c("MAE","RMSE")] # Returns MAE and RMSE averaged over the 2 test folds

##              MAE     RMSE
## Average 1.390523 1.498126

Notice that the the result, IBEvaluation, contains both the MAE and RMSE values averaged over the test folds.

Exercise: Repeat the cross-validation process above for different values of $k$, and plot both MAE and RMSE as a function of $k$ (from 1 to 9). What is the best value of $k$ according to each measure? Use set.seed(0). Notice that the smaller the MAE and RMSE values the better (more accurate) the results.

Exercise: Repeat the previous exercise but now for User-Based KNN instead of Item-Based KNN.

ADDITIONAL ACTIVITY: Repeat the previous exercise, but now with the much larger, real-world data set mlLatest100k (Movielens Latest), which is available as part of the rrecsys package. According to the official package documentation [Package rrecsys PDF Guide]:

“This dataset is a 5-star rating dataset from MovieLens, a movie recommendation service of the GroupLens research group at the University of Minnesota. It contains 100234 ratings across 8927 movies. The data was created by 718 users between March 26, 1996 and August 05, 2015… generated on August 06, 2015. Users were selected at random for inclusion. All selected users had rated at least 20 movies. The data is edited and structured as a matrix and distributed as such… Source: http://grouplens.org/datasets/movielens/latest/”

Notes:

Use 10-fold cross-validation (i.e., set attribute folds to 10 in function evalModel())
Vary $k$ (i.e. attribute neigh in evalPred()) as $k = 1, 5, 10, 15, 20, \cdots, 45, 50$
Processing may take several minutes

Activities

Week 6 Quiz

Complete the week 6 online quiz in Blackboard Learn Ultra.

References

Hahsler, Michael. 2017. Recommenderlab: Lab for Developing and Testing Recommender Algorithms. http://lyle.smu.edu/IDA/recommenderlab/.

Tan, P.-N., M. Steinbach, and V. Kumar. 2006. Introduction to Data Mining. Addison Wesley.

Introduction to Data Mining - Week 6

Ricardo Campello

30 April 2018