Association Rules
Association rule mining is the process of finding patterns, associations and correlations among sets of items in a database. The association rules generated have an antecedent and a consequent. An association rule is a pattern of the form X & Y Z [support, confidence], where X, Y, and Z are items in the dataset. The left hand side of the rule X & Y is called the antecedent of the rule and the right hand side Z is called the consequent of the rule. This means that given X and Y there is some association with Z. Within the dataset, confidence and support are two measures to determine the certainty or usefulness for each rule. Support is the probability that a set of items in the dataset contains both the antecedent and consequent of the rule or P (X Y Z). Confidence is the probability that a set of items containing the antecedent also contains the consequent or P (Z X Y). Typically an association rule is called strong if it satisfies both a minimum support threshold and a minimum confidence threshold that is determined by the user (Han).
Assume a sports equipment store would like to determine if there were any association rules between the items bought together during a single purchase. They collected this data:
Transaction
|
Items
|
1
|
jersey, mouth guard, pants
|
2
|
jersey, pants
|
3
|
jacket, jersey
|
4
|
hat, mouth guard, stick
|
Table 1: Transactions of sports equipment
A rule you might find would be that jersey pants. Support for this rule is the probability that a transaction contains jersey and pants together. Here it occurs two out of four times (2/4) which is 50%. The confidence is the probability of a transaction containing jersey also contains pants. There are three transactions containing jersey and two containing jersey and pants so the confidence is 2/3 or 66%. Therefore this rule written in complete form is jersey pants [50%, 66%].
If instead the pants were chosen to be the antecedent we would have the rule pants jersey [50%, 100%]. In this case the confidence is now 100% because in every transaction containing pants there is also the jersey with it. Setting minimums for both support and confidence we can limit the rules to only those of significance. These association rules are rather easy to establish because of the small dataset but it becomes more difficult as the dataset grows.
Some basic terms and concepts need to be understood in order to further understand data mining on large datasets. An itemset is defined as a set of items. An itemset containing k items is a k-itemset. Therefore the set {A, B} is a 2-itemset. The occurrence frequency of an itemset is simply the number of transactions that contain the itemset. It is sometimes referred to as the frequency, support count, or count of the itemset. The minimum support count is defined as the number of transactions required for the itemset to satisfy minimum support. The minimum support count is equal to the product of the total number of transactions in the dataset and the user-defined minimum support. Any itemsets satisfying the minimum support is considered a frequent itemset and the set of k-itemsets is commonly denoted by Lk (Han).
Going back to the example from Figure 1 we can obtain the following values. The occurrence frequency of the itemset, { jersey}, would equal 3 because it occurs in 3 of the transactions in the database. The minimum support count is equal to the total transactions x support or in the case above it would be 4 x 50% = 2. This is because we previously wanted a minimum support threshold of 50%. Therefore the 1-itemset, {jersey}, would satisfy the minimum support and therefore it would be considered a frequent itemset and contained in L1.
Association rule mining on large datasets is broken down into two steps. First, find all the itemsets that occur at least as frequently as the pre-determined minimum support count. This step will produce k lists of frequent itemsets from L1 to Lk. The second step is to generate strong association rules from the frequent itemsets. An association rule is considered strong if it satisfies both a minimum support and minimum confidence. The overall performance of mining association rules is determined by finding the frequent itemsets in the first step. An important but simple algorithm for finding frequent itemsets is the Apriori Algorithm (Han).
Apriori Algorithm
The Apriori algorithm is a basic algorithm for finding frequent itemsets from a set of data by using candidate generation. Apriori uses an iterative approach known as a level-wise search because the k-itemsets is used to determine the (k + 1)-itemsets. The search begins for the set of frequent 1-itemsets denoted L1. L1 is then used to find the set of frequent 2-iemsets, L2. L2 is then used to find L3 and so on. This continues until no more frequent k-itemsets can be found (Han).
To improve efficiency of a level-wise generation the Apriori algorithm uses the Apriori property. The Apriori property states that all nonempty subsets of a frequent itemset are also a frequent itemsets. So, if {A, B} is a frequent itemset then subsets {A} and {B} are also frequent itemsets. The level-wise search uses this Apriori property when stepping from level to the next. If an itemset I does not satisfy the minimal support then I will not be considered a frequent itemset. If item A is added to the itemset I then the new itemset I A cannot occur more frequently then the original itemset I. If an itemset fails to be considered a frequent itemset then all supersets of that itemset will also fail that same test. The Apriori algorithm uses this property to decrease the number of itemsets in the candidate list therefore optimizing search time. As the Apriori algorithm steps from finding Lk-1 to finding Lk it uses a two-step process consisting of the Join Step and the Prune Step (Han).
The first step is the Join Step and it is responsible for generating a set of candidate k-itemsets denoted Ck from Lk-1. It does this by joining Lk-1 with itself. Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. The joining Lk-1 to Lk-1 is only performed between itemsets that have the first (k-2) items in common with each other. Suppose itemsets I1 and I2 are members of Lk-1. They will be joined with each other if (I1[1] = I2[1] and I1[2] = I2[2] and … and I1[k-2] = I2[k-2] and I1[k-1] < I2[k-1]). Where I1[1] is the first item in itemset I1 and I1[k-1] is the last item in I1 and so on for I2. It checks to make sure all k-2 items are equal and then lastly makes sure the last item in the itemset is unequal in order to eliminate duplicate candidate k-itemsets. The new candidate k-itemset generated from joining I1 with I2 would be I1[1] I1[2]… I1[k-1]I2[k-1] (Han).
The second step is the Prune Step and converts Ck to Lk. The candidate list Ck contains all of the frequent k-itemsets but it also contains k-itemsets that do not satisfy the minimum support count. The scan of the database will determine the occurrence frequency of every candidate k-itemset to determine if it satisfies the minimum support. This will be very costly as Ck becomes large. To reduce the size of Ck the Apriori property is used. The Apriori property states that any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Therefore if any (k-1)-subset of a candidate k-itemset is not in Lk-1 then it can be removed from Ck because it too cannot be frequent. This can be done rather quickly by utilizing a hash tree containing all frequent itemsets. The resulting set of frequent k-itemsets is denoted Lk (Han).
Assume there is that same storeowner who wishes to find associations with items bought from his store. Letters will be substituted for the items from Table 1 for better understanding and clarity. Given the following data in Table 2 let us run step-by-step through the Apriori algorithm and set the minimum for confidence and support to 50%.
Transaction
|
Item List
|
1
|
A, C, D
|
2
|
B, C, E
|
3
|
B, E
|
4
|
A, B, C, E
|
Table 2: Transaction list in lexicographic order
For the first iteration of the algorithm each item becomes a member of the set of candidate 1-itemsets C1 (Fig. 1). The algorithm then scans all of the transactions to count the number of occurrences of each item. Using the predetermined minimum support of 50% the minimum support would therefore be 50% x 4 transactions which equals 2. Only candidate 1-itemsets satisfying minimum support will be kept to generate the set of frequent 1-itemsets L1.
On the second iteration, the algorithm is attempting to generate the set of frequent 2-itemsets L2. It does this by joining L1 with itself to generate C2, a candidate set of 2-itemsets. Again, the transactions in the dataset are scanned to determine the support count for each candidate 2-itemset in C2. All candidate 2-itemsts in C2 that satisfy minimum support will be included in the set of frequent 2-itemsets L2 (Han).
The next iteration begins with generating the set of candidate 3-itemsets, C3, by joining L2 to itself by joining all itemsets that have the first item similar to each other. This occurs with the itemsets {B, C} and {B, E} because B = B and C < E. The resulting candidate list is therefore {B, C, E}. The occurrence frequency of the itemset is counted and if it satisfies the minimum support count it will be included in the set of frequent 3-itemsets, L3 (Han).
C1 L1 C2
L2 C3 L3
Figure 1: Demonstrates Aprior Algorithm
The algorithm attempts another iteration but because it cannot generate any candidate 4-itemets C4 will be an empty set. At this point the algorithm will terminate having found all of the frequent itemsets (Han).
The generation of strong association rules is now rather straightforward now that all of the frequent itemsets have been found for the database. To be considered a strong association rule it must satisfy both minimum support and minimum confidence. The minimum support is automatically satisfied because all of the association rules are generated from the frequent itemsets. The confidence must be computed for each rule to ensure it satisfies the minimum confidence. The confidence for the rule A B is the conditional probability P (B | A) which is equal to the support count of (A B) divided by the support count of (A) (Han).
Based on this equation we can generate association rules. For every frequent itemset l generate all nonempty subsets of l. Taking all the nonempty subsets of each l output the following rules where s is a subset of l. A strong association rule is defined as s (l – s) only if the support count of (l) divided by the support count of (s) satisfies the minimum confidence previously determined for the association rules. This is done for every subset (s) of l and for every frequent itemset l (Han).
Using the previous example from Figure 1 we will generate the association rules for this example. We will also compute the confidence for each rule and eliminate any rules that do not meet the minimum confidence previously set at 50%.
A C confidence = 2/2 = 100%
C A confidence = 2/3 = 66%
B C confidence = 2/3 = 66%
C B confidence = 2/3 = 66%
B E confidence = 3/3 = 100%
E B confidence = 3/3 = 100%
B & C E confidence = 2/2 = 100%
B & E C confidence = 2/3 = 60%
C & E B confidence = 2/2 = 100%
All of these association rules are considered strong because they meet the minimum confidence and support which was set at 50%.
Bibliography
Han, Jiawei and Micheline Kamber. 2001. Data Mining: Concepts and Techniques. Morgan Kaufman Publishers.
Share with your friends: |