Association Rules
Association rule mining is the process of finding patterns, associations and correlations among sets of items in a database. The association rules generated have an antecedent and a consequent. An association rule is a pattern of the form X & Y Z [support, confidence], where X, Y, and Z are items in the dataset. The left hand side of the rule X & Y is called the antecedent of the rule and the right hand side Z is called the consequent of the rule. This means that given X and Y there is some association with Z. Within the dataset, confidence and support are two measures to determine the certainty or usefulness for each rule. Support is the probability that a set of items in the dataset contains both the antecedent and consequent of the rule or P (X Y Z). Confidence is the probability that a set of items containing the antecedent also contains the consequent or P (Z X Y). Typically an association rule is called strong if it satisfies both a minimum support threshold and a minimum confidence threshold that is determined by the user (Han).
Assume a sports equipment store would like to determine if there were any association rules between the items bought together during a single purchase. They collected this data:
Transaction

Items

1

jersey, mouth guard, pants

2

jersey, pants

3

jacket, jersey

4

hat, mouth guard, stick

Table 1: Transactions of sports equipment
A rule you might find would be that jersey pants. Support for this rule is the probability that a transaction contains jersey and pants together. Here it occurs two out of four times (2/4) which is 50%. The confidence is the probability of a transaction containing jersey also contains pants. There are three transactions containing jersey and two containing jersey and pants so the confidence is 2/3 or 66%. Therefore this rule written in complete form is jersey pants [50%, 66%].
If instead the pants were chosen to be the antecedent we would have the rule pants jersey [50%, 100%]. In this case the confidence is now 100% because in every transaction containing pants there is also the jersey with it. Setting minimums for both support and confidence we can limit the rules to only those of significance. These association rules are rather easy to establish because of the small dataset but it becomes more difficult as the dataset grows.
Some basic terms and concepts need to be understood in order to further understand data mining on large datasets. An itemset is defined as a set of items. An itemset containing k items is a kitemset. Therefore the set {A, B} is a 2itemset. The occurrence frequency of an itemset is simply the number of transactions that contain the itemset. It is sometimes referred to as the frequency, support count, or count of the itemset. The minimum support count is defined as the number of transactions required for the itemset to satisfy minimum support. The minimum support count is equal to the product of the total number of transactions in the dataset and the userdefined minimum support. Any itemsets satisfying the minimum support is considered a frequent itemset and the set of kitemsets is commonly denoted by L_{k} (Han).
Going back to the example from Figure 1 we can obtain the following values. The occurrence frequency of the itemset, { jersey}, would equal 3 because it occurs in 3 of the transactions in the database. The minimum support count is equal to the total transactions x support or in the case above it would be 4 x 50% = 2. This is because we previously wanted a minimum support threshold of 50%. Therefore the 1itemset, {jersey}, would satisfy the minimum support and therefore it would be considered a frequent itemset and contained in L_{1}.
Association rule mining on large datasets is broken down into two steps. First, find all the itemsets that occur at least as frequently as the predetermined minimum support count. This step will produce k lists of frequent itemsets from L_{1} to L_{k}. The second step is to generate strong association rules from the frequent itemsets. An association rule is considered strong if it satisfies both a minimum support and minimum confidence. The overall performance of mining association rules is determined by finding the frequent itemsets in the first step. An important but simple algorithm for finding frequent itemsets is the Apriori Algorithm (Han).
Apriori Algorithm
The Apriori algorithm is a basic algorithm for finding frequent itemsets from a set of data by using candidate generation. Apriori uses an iterative approach known as a levelwise search because the kitemsets is used to determine the (k + 1)itemsets. The search begins for the set of frequent 1itemsets denoted L_{1}. L_{1} is then used to find the set of frequent 2iemsets, L_{2}. L_{2} is then used to find L_{3} and so on. This continues until no more frequent kitemsets can be found (Han).
To improve efficiency of a levelwise generation the Apriori algorithm uses the Apriori property. The Apriori property states that all nonempty subsets of a frequent itemset are also a frequent itemsets. So, if {A, B} is a frequent itemset then subsets {A} and {B} are also frequent itemsets. The levelwise search uses this Apriori property when stepping from level to the next. If an itemset I does not satisfy the minimal support then I will not be considered a frequent itemset. If item A is added to the itemset I then the new itemset I A cannot occur more frequently then the original itemset I. If an itemset fails to be considered a frequent itemset then all supersets of that itemset will also fail that same test. The Apriori algorithm uses this property to decrease the number of itemsets in the candidate list therefore optimizing search time. As the Apriori algorithm steps from finding L_{k}_{1} to finding L_{k} it uses a twostep process consisting of the Join Step and the Prune Step (Han).
The first step is the Join Step and it is responsible for generating a set of candidate kitemsets denoted C_{k} from L_{k}_{1}. It does this by joining L_{k}_{1} with itself. Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. The joining L_{k1} to L_{k1} is only performed between itemsets that have the first (k2) items in common with each other. Suppose itemsets I_{1} and I_{2} are members of L_{k}_{1}. They will be joined with each other if (I_{1}[1] = I_{2}[1] and I_{1}[2] = I_{2}[2] and … and I_{1}[k2] = I_{2}[k2] and I_{1}[k1] < I_{2}[k1]). Where I_{1}[1] is the first item in itemset I_{1} and I_{1}[k1] is the last item in I_{1} and so on for I_{2}. It checks to make sure all k2 items are equal and then lastly makes sure the last item in the itemset is unequal in order to eliminate duplicate candidate kitemsets. The new candidate kitemset generated from joining I_{1} with I_{2} would be I_{1}[1] I_{1}[2]… I_{1}[k1]I_{2}[k1] (Han).
The second step is the Prune Step and converts C_{k} to L_{k}. The candidate list C_{k} contains all of the frequent kitemsets but it also contains kitemsets that do not satisfy the minimum support count. The scan of the database will determine the occurrence frequency of every candidate kitemset to determine if it satisfies the minimum support. This will be very costly as C_{k} becomes large. To reduce the size of C_{k} the Apriori property is used. The Apriori property states that any (k1)itemset that is not frequent cannot be a subset of a frequent kitemset. Therefore if any (k1)subset of a candidate kitemset is not in L_{k}_{1} then it can be removed from C_{k} because it too cannot be frequent. This can be done rather quickly by utilizing a hash tree containing all frequent itemsets. The resulting set of frequent kitemsets is denoted L_{k} (Han).
Assume there is that same storeowner who wishes to find associations with items bought from his store. Letters will be substituted for the items from Table 1 for better understanding and clarity. Given the following data in Table 2 let us run stepbystep through the Apriori algorithm and set the minimum for confidence and support to 50%.
Transaction

Item List

1

A, C, D

2

B, C, E

3

B, E

4

A, B, C, E

Table 2: Transaction list in lexicographic order
For the first iteration of the algorithm each item becomes a member of the set of candidate 1itemsets C_{1} (Fig. 1). The algorithm then scans all of the transactions to count the number of occurrences of each item. Using the predetermined minimum support of 50% the minimum support would therefore be 50% x 4 transactions which equals 2. Only candidate 1itemsets satisfying minimum support will be kept to generate the set of frequent 1itemsets L_{1}.
On the second iteration, the algorithm is attempting to generate the set of frequent 2itemsets L_{2}. It does this by joining L_{1} with itself to generate C_{2}, a candidate set of 2itemsets. Again, the transactions in the dataset are scanned to determine the support count for each candidate 2itemset in C_{2}. All candidate 2itemsts in C_{2} that satisfy minimum support will be included in the set of frequent 2itemsets L_{2} (Han).
The next iteration begins with generating the set of candidate 3itemsets, C_{3}, by joining L_{2} to itself by joining all itemsets that have the first item similar to each other. This occurs with the itemsets {B, C} and {B, E} because B = B and C < E. The resulting candidate list is therefore {B, C, E}. The occurrence frequency of the itemset is counted and if it satisfies the minimum support count it will be included in the set of frequent 3itemsets, L_{3} (Han).
C_{1} L_{1 } C_{2}
L_{2} C_{3} L_{3}
Figure 1: Demonstrates Aprior Algorithm
The algorithm attempts another iteration but because it cannot generate any candidate 4itemets C_{4} will be an empty set. At this point the algorithm will terminate having found all of the frequent itemsets (Han).
The generation of strong association rules is now rather straightforward now that all of the frequent itemsets have been found for the database. To be considered a strong association rule it must satisfy both minimum support and minimum confidence. The minimum support is automatically satisfied because all of the association rules are generated from the frequent itemsets. The confidence must be computed for each rule to ensure it satisfies the minimum confidence. The confidence for the rule A B is the conditional probability P (B  A) which is equal to the support count of (A B) divided by the support count of (A) (Han).
Based on this equation we can generate association rules. For every frequent itemset l generate all nonempty subsets of l. Taking all the nonempty subsets of each l output the following rules where s is a subset of l. A strong association rule is defined as s (l – s) only if the support count of (l) divided by the support count of (s) satisfies the minimum confidence previously determined for the association rules. This is done for every subset (s) of l and for every frequent itemset l (Han).
Using the previous example from Figure 1 we will generate the association rules for this example. We will also compute the confidence for each rule and eliminate any rules that do not meet the minimum confidence previously set at 50%.
A C confidence = 2/2 = 100%
C A confidence = 2/3 = 66%
B C confidence = 2/3 = 66%
C B confidence = 2/3 = 66%
B E confidence = 3/3 = 100%
E B confidence = 3/3 = 100%
B & C E confidence = 2/2 = 100%
B & E C confidence = 2/3 = 60%
C & E B confidence = 2/2 = 100%
All of these association rules are considered strong because they meet the minimum confidence and support which was set at 50%.
Bibliography
Han, Jiawei and Micheline Kamber. 2001. Data Mining: Concepts and Techniques. Morgan Kaufman Publishers.
Share with your friends: 