Association Rules

Download 23.96 Kb.

Date	07.08.2017
Size	23.96 Kb.
	#28699

Association Rules

Association rule mining is the process of finding patterns, associations and correlations among sets of items in a database. The association rules generated have an antecedent and a consequent. An association rule is a pattern of the form X & Y  Z [support, confidence], where X, Y, and Z are items in the dataset. The left hand side of the rule X & Y is called the antecedent of the rule and the right hand side Z is called the consequent of the rule. This means that given X and Y there is some association with Z. Within the dataset, confidence and support are two measures to determine the certainty or usefulness for each rule. Support is the probability that a set of items in the dataset contains both the antecedent and consequent of the rule or P (X  Y  Z). Confidence is the probability that a set of items containing the antecedent also contains the consequent or P (Z X  Y). Typically an association rule is called strong if it satisfies both a minimum support threshold and a minimum confidence threshold that is determined by the user (Han).

Assume a sports equipment store would like to determine if there were any association rules between the items bought together during a single purchase. They collected this data:

Transaction	Items
1	jersey, mouth guard, pants
2	jersey, pants
3	jacket, jersey
4	hat, mouth guard, stick

Table 1: Transactions of sports equipment
A rule you might find would be that jersey  pants. Support for this rule is the probability that a transaction contains jersey and pants together. Here it occurs two out of four times (2/4) which is 50%. The confidence is the probability of a transaction containing jersey also contains pants. There are three transactions containing jersey and two containing jersey and pants so the confidence is 2/3 or 66%. Therefore this rule written in complete form is jersey  pants [50%, 66%].

If instead the pants were chosen to be the antecedent we would have the rule pants  jersey [50%, 100%]. In this case the confidence is now 100% because in every transaction containing pants there is also the jersey with it. Setting minimums for both support and confidence we can limit the rules to only those of significance. These association rules are rather easy to establish because of the small dataset but it becomes more difficult as the dataset grows.

Some basic terms and concepts need to be understood in order to further understand data mining on large datasets. An itemset is defined as a set of items. An itemset containing k items is a k-itemset. Therefore the set {A, B} is a 2-itemset. The occurrence frequency of an itemset is simply the number of transactions that contain the itemset. It is sometimes referred to as the frequency, support count, or count of the itemset. The minimum support count is defined as the number of transactions required for the itemset to satisfy minimum support. The minimum support count is equal to the product of the total number of transactions in the dataset and the user-defined minimum support. Any itemsets satisfying the minimum support is considered a frequent itemset and the set of k-itemsets is commonly denoted by L_k (Han).

Going back to the example from Figure 1 we can obtain the following values. The occurrence frequency of the itemset, { jersey}, would equal 3 because it occurs in 3 of the transactions in the database. The minimum support count is equal to the total transactions x support or in the case above it would be 4 x 50% = 2. This is because we previously wanted a minimum support threshold of 50%. Therefore the 1-itemset, {jersey}, would satisfy the minimum support and therefore it would be considered a frequent itemset and contained in L₁.

Association rule mining on large datasets is broken down into two steps. First, find all the itemsets that occur at least as frequently as the pre-determined minimum support count. This step will produce k lists of frequent itemsets from L₁ to L_k. The second step is to generate strong association rules from the frequent itemsets. An association rule is considered strong if it satisfies both a minimum support and minimum confidence. The overall performance of mining association rules is determined by finding the frequent itemsets in the first step. An important but simple algorithm for finding frequent itemsets is the Apriori Algorithm (Han).
Apriori Algorithm

The Apriori algorithm is a basic algorithm for finding frequent itemsets from a set of data by using candidate generation. Apriori uses an iterative approach known as a level-wise search because the k-itemsets is used to determine the (k + 1)-itemsets. The search begins for the set of frequent 1-itemsets denoted L₁. L₁ is then used to find the set of frequent 2-iemsets, L₂. L₂ is then used to find L₃ and so on. This continues until no more frequent k-itemsets can be found (Han).

To improve efficiency of a level-wise generation the Apriori algorithm uses the Apriori property. The Apriori property states that all nonempty subsets of a frequent itemset are also a frequent itemsets. So, if {A, B} is a frequent itemset then subsets {A} and {B} are also frequent itemsets. The level-wise search uses this Apriori property when stepping from level to the next. If an itemset I does not satisfy the minimal support then I will not be considered a frequent itemset. If item A is added to the itemset I then the new itemset I  A cannot occur more frequently then the original itemset I. If an itemset fails to be considered a frequent itemset then all supersets of that itemset will also fail that same test. The Apriori algorithm uses this property to decrease the number of itemsets in the candidate list therefore optimizing search time. As the Apriori algorithm steps from finding L_k-₁ to finding L_k it uses a two-step process consisting of the Join Step and the Prune Step (Han).

The first step is the Join Step and it is responsible for generating a set of candidate k-itemsets denoted C_k from L_k_-1. It does this by joining L_k_-1 with itself. Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. The joining L_k-1 to L_k-1 is only performed between itemsets that have the first (k-2) items in common with each other. Suppose itemsets I₁ and I₂ are members of L_k_-1. They will be joined with each other if (I₁[1] = I₂[1] and I₁[2] = I₂[2] and … and I₁[k-2] = I₂[k-2] and I₁[k-1] < I₂[k-1]). Where I₁[1] is the first item in itemset I₁ and I₁[k-1] is the last item in I₁ and so on for I₂. It checks to make sure all k-2 items are equal and then lastly makes sure the last item in the itemset is unequal in order to eliminate duplicate candidate k-itemsets. The new candidate k-itemset generated from joining I₁ with I₂ would be I₁[1] I₁[2]… I₁[k-1]I₂[k-1] (Han).

The second step is the Prune Step and converts C_k to L_k. The candidate list C_k contains all of the frequent k-itemsets but it also contains k-itemsets that do not satisfy the minimum support count. The scan of the database will determine the occurrence frequency of every candidate k-itemset to determine if it satisfies the minimum support. This will be very costly as C_k becomes large. To reduce the size of C_k the Apriori property is used. The Apriori property states that any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Therefore if any (k-1)-subset of a candidate k-itemset is not in L_k_-1 then it can be removed from C_k because it too cannot be frequent. This can be done rather quickly by utilizing a hash tree containing all frequent itemsets. The resulting set of frequent k-itemsets is denoted L_k (Han).

Assume there is that same storeowner who wishes to find associations with items bought from his store. Letters will be substituted for the items from Table 1 for better understanding and clarity. Given the following data in Table 2 let us run step-by-step through the Apriori algorithm and set the minimum for confidence and support to 50%.

Transaction	Item List
1	A, C, D
2	B, C, E
3	B, E
4	A, B, C, E

Table 2: Transaction list in lexicographic order
For the first iteration of the algorithm each item becomes a member of the set of candidate 1-itemsets C₁ (Fig. 1). The algorithm then scans all of the transactions to count the number of occurrences of each item. Using the predetermined minimum support of 50% the minimum support would therefore be 50% x 4 transactions which equals 2. Only candidate 1-itemsets satisfying minimum support will be kept to generate the set of frequent 1-itemsets L₁.

On the second iteration, the algorithm is attempting to generate the set of frequent 2-itemsets L₂. It does this by joining L₁ with itself to generate C₂, a candidate set of 2-itemsets. Again, the transactions in the dataset are scanned to determine the support count for each candidate 2-itemset in C₂. All candidate 2-itemsts in C₂ that satisfy minimum support will be included in the set of frequent 2-itemsets L₂ (Han).

The next iteration begins with generating the set of candidate 3-itemsets, C₃, by joining L₂ to itself by joining all itemsets that have the first item similar to each other. This occurs with the itemsets {B, C} and {B, E} because B = B and C < E. The resulting candidate list is therefore {B, C, E}. The occurrence frequency of the itemset is counted and if it satisfies the minimum support count it will be included in the set of frequent 3-itemsets, L₃ (Han).

C₁ L₁ C₂

L₂ C₃ L₃

Figure 1: Demonstrates Aprior Algorithm

The algorithm attempts another iteration but because it cannot generate any candidate 4-itemets C₄ will be an empty set. At this point the algorithm will terminate having found all of the frequent itemsets (Han).

The generation of strong association rules is now rather straightforward now that all of the frequent itemsets have been found for the database. To be considered a strong association rule it must satisfy both minimum support and minimum confidence. The minimum support is automatically satisfied because all of the association rules are generated from the frequent itemsets. The confidence must be computed for each rule to ensure it satisfies the minimum confidence. The confidence for the rule A  B is the conditional probability P (B | A) which is equal to the support count of (A  B) divided by the support count of (A) (Han).

Based on this equation we can generate association rules. For every frequent itemset l generate all nonempty subsets of l. Taking all the nonempty subsets of each l output the following rules where s is a subset of l. A strong association rule is defined as s  (l – s) only if the support count of (l) divided by the support count of (s) satisfies the minimum confidence previously determined for the association rules. This is done for every subset (s) of l and for every frequent itemset l (Han).

Using the previous example from Figure 1 we will generate the association rules for this example. We will also compute the confidence for each rule and eliminate any rules that do not meet the minimum confidence previously set at 50%.

A  C confidence = 2/2 = 100%

C  A confidence = 2/3 = 66%

B  C confidence = 2/3 = 66%

C  B confidence = 2/3 = 66%

B  E confidence = 3/3 = 100%

E  B confidence = 3/3 = 100%

B & C  E confidence = 2/2 = 100%

B & E  C confidence = 2/3 = 60%

C & E  B confidence = 2/2 = 100%
All of these association rules are considered strong because they meet the minimum confidence and support which was set at 50%.

Bibliography

Han, Jiawei and Micheline Kamber. 2001. Data Mining: Concepts and Techniques. Morgan Kaufman Publishers.

Download 23.96 Kb.

Share with your friends: