Oracle® Data Mining Concepts 11g Release 1 (11.1) Part Number B28129-03 |
|
|
View PDF |
This chapter describes association, the unsupervised mining function for discovering association rules.
See Also:
"Unsupervised Data Mining"This chapter contains the following topics:
Association is a data mining function that discovers the probability of the co-occurrence of items in a collection. The relationships between co-occurring items are expressed as association rules.
Association rules are often used to analyze sales transactions. For example, it might be noted that customers who buy cereal at the grocery store often buy milk at the same time. In fact, association analysis might find that 85% of the checkout sessions that include cereal also include milk. This relationship could be formulated as the following rule.
Cereal implies milk with 85% confidence
This application of association modeling is called market-basket analysis. It is valuable for direct marketing, sales promotions, and for discovering business trends. Market-basket analysis can also be used effectively for store layout, catalog design, and cross-sell.
Association modeling has important applications in other domains as well. For example, in e-commerce applications, association rules may be used for Web page personalization. An association model might find that a user who visits pages A and B is 70% likely to also visit page C in the same session. Based on this rule, a dynamic link could be created for users who are likely to be interested in page C. The association rule could be expressed as follows.
A and B imply C with 70% confidence
See Also:
"Confidence"Unlike other data mining functions, association is transaction-based. In transaction processing, a case consists of a transaction such as a market basket or Web session. The collection of items in the transaction is an attribute of the transaction. Other attributes might be the date, time, location, or user ID associated with the transaction.
The collection of items in the transaction is a multi-record attribute. Transactional data is said to be in multi-record case format. An example is shown in Figure 8-1.
Since Oracle Data Mining requires single-record case format, the column that holds the collection must be transformed to a nested table type prior to mining for association rules. Transactional data in single-record case format is shown in Figure 8-2.
See Also:
Figure 4-3, "Sample Build Data for Regression" and Figure 7-2, "Build Data for Clustering" for examples of single-record case format
Oracle Data Mining Application Developer's Guide for information on nested table transformation
In transactional data, a collection is associated with each case. A collection defines all possible items that can potentially be present in a transaction. Typically only a tiny subset of all possible items are included. For example, the items present in a market-basket transaction represent only a small fraction of the items available for sale in the store.
Oracle Data Mining implements collections as nested tables, as shown in Figure 8-2. Each nested row specifies an item name and a value. If the item is present, its value is 1. If the item is not present, its value is null. Many of the item values may be null, since many of the items that could be in the collection are probably not present in any individual transaction.
The null values in nested data indicate sparsity. This means that a high proportion of the item definitions in the collection are not populated. The Oracle Data Mining association algorithm is optimized for processing sparse data.
See Also:
Oracle Data Mining Application Developer's Guide for information about Oracle Data Mining and sparse dataThe first step in association analysis is the enumeration of frequent itemsets. A frequent itemset is a combination of items in a transaction. Oracle Data Mining uses the DBMS_FREQUENT_ITEMSETS
package to count frequent itemsets.
The maximum number of items in a frequent itemset is user-specified. The minimum number is two. If the minimum is used, all the item pairs will be counted. If a number greater than two is used, all the item pairs, all the item triples, and all the item combinations up to the specified maximum will be counted.
The maximum number of items in a frequent itemset is specified by the ASSO_MAX_RULE_LENGTH
setting, which also applies to the rules derived from the frequent itemsets. For details, see Oracle Database PL/SQL Packages and Types Reference.
Table 8-1 shows the frequent itemsets derived from the transactions in Figure 8-1, assuming that ASSO_MAX_RULE_LENGTH
is set to 3.
Table 8-1 Frequent Itemsets
Transaction | Frequent Itemsets |
---|---|
11 |
(B,D) (B,E) (D,E) (B,D,E) |
12 |
(A,B) (A,C) (A,E) (B,C) (B,E) (C,E) (A,B,C) (A,B,E) (A,C,E) (B,C,E) |
13 |
(B,C,)(B,D) (B,E) (C,D) (C,E) (D,E) (B,C,D) (B,C,E) (B,D,E) (C,D,E) |
Tip:
Decrease the maximum rule length if you want to decrease the build time for the model and generate simpler rules.Association rules are derived from the frequent itemsets generated by the model. If rules are generated from all possible frequent itemsets, there may be a very high number of rules and the rules may not be very meaningful. Also, the model may take a long time to build. Typically it is desirable to only generate rules from frequent itemsets that are well-represented in the data.
The minimum frequent itemset support is a user-specified percentage that limits the number of frequent itemsets produced by the model. A frequent itemset must appear in at least this percentage of all the transactions if it is to be used as a basis for rules.
The ASSO_MIN_SUPPORT
setting specifies the minimum frequent itemset support. It also applies to the rules derived from the frequent itemsets. For details, see Oracle Database PL/SQL Packages and Types Reference.
Table 8-2 shows the frequent itemsets that can be used to generate rules from the transactions in Figure 8-2 if the minimum support is 8%.
Table 8-2 Frequent Itemsets with Support > 8%
Frequent Itemset | Occurrences | Percentage |
---|---|---|
(B,D) |
2 |
8% |
(B,E) |
3 |
12% |
(D,E) |
2 |
8%* |
(B,D,E) |
2 |
8% |
(B,C) |
2 |
8% |
(C,E) |
2 |
8% |
(B,C,E) |
2 |
8% |
Tip:
Increase the minimum support if you want to decrease the build time for the model and generate fewer rules.Association rules express probabilistic relationships between items in a frequent itemset. Rules are formulated as IF-THEN statements. For example, a rule derived from a frequent itemset containing A, B, and C might state that if A and B are included in a transaction, then C is likely to also be included.
Association rules are different from the rules produced by decision trees or clustering algorithms. In the latter, the IF component is followed by a boolean expression; the THEN component is evaluated only if the IF component is true. For example: "IF age > 45 and frequent_buyer = true, THEN likely_prospect."
Association rules are simply sets of items that can occur in a transaction. The items in the IF component imply the likely presence of the items in the THEN component. Association rules are probabilistic in nature. The IF-THEN statements express correlation, not causation.
The IF component of an association rule is known as the antecedent. The THEN component is known as the consequent. The antecedent and the consequent are disjoint; they have no items in common.
Oracle Data Mining supports association rules that have one or more items in the antecedent and a single item in the consequent.
Rules have an associated confidence, which is the conditional probability that the consequent will occur given the occurrence of the antecedent.
The ASSO_MIN_CONFIDENCE
setting specifies the minimum confidence for rules. The model eliminates any rules with confidence below the required minimum.
Table 8-3 shows the rules derived from the frequent itemsets listed in Table 8-2. The confidence for each rule is included. For example, the confidence of the rule "If B then D" is 28.5%, because there are seven rules with antecedent B and two of them have D as the consequent (2/7 = 28.5%).
Table 8-3 Frequent Itemsets and Rules
Frequent Itemset | Rules | Confidence |
---|---|---|
(B,D) |
(If B then D) (If D then B) |
28.5% 50% |
(B,E) |
(If B then E) (If E then B) |
43% 43% |
(D,E) |
(If D then E) I(f E then D) |
50% 28.5% |
(B,D,E) |
(If B then D) (If D then B) If B then E) (If E then B) (If D then E) (If E then D) (If B and D then E) (If B and E then D) (If D and E then B) |
28.5% 50% 43% 43% 50% 28.5 100% 50% 100% |
(B,C) |
(If B then C) (If C then B) |
28.5% 50% |
(C,E) |
(If C then E) (If E then C) |
50% 28.5 |
(B,C,E) |
(If B then C) (If C then B) (If B then E) (If E then B) (If C then E) (If E then C) (If B and C then E) (If B and E then C) (If C and E then B) |
28.5 50% 43% 43% 50% 28.5% 100% 50% 100% |
Tip:
Increase the minimum confidence if you want to decrease the build time for the model and generate fewer rules.Minimum support and confidence are used to influence the build of an association model. Support and confidence are also the primary metrics for evaluating the quality of the rules generated by the model. Additionally, Oracle Data Mining supports lift for association rules. These statistical measures can be used to rank the rules and hence the predictions.
The support of a rule indicates how frequently the items in the rule occur together. For example, cereal and milk might appear together in 40% of the transactions. If so, the following two rules would each have a support of 40%.
cereal implies milk milk implies cereal
Support is the ratio of transactions that include all the items in the antecedent and consequent to the number of total transactions.
Support can be expressed in probability notation as follows.
support(A implies B) = P(A, B)
The confidence of a rule indicates the probability of both the antecedent and the consequent appearing in the same transaction. Confidence is the conditional probability of the consequent given the antecedent. For example, cereal might appear in 50 transactions; 40 of the 50 might also include milk. The rule confidence would be:
cereal implies milk with 80% confidence
Confidence is the ratio of the rule support to the number of transactions that include the antecedent.
Confidence can be expressed in probability notation as follows.
confidence (A implies B) = P (B/A), which is equal to P(A, B) / P(A)
Both support and confidence must be used to determine if a rule is valid. However, there are times when both of these measures may be high, and yet still produce a rule that is not useful. For example:
Convenience store customers who buy orange juice also buy milk with a 75% confidence. The combination of milk and orange juice has a support of 30%.
This at first sounds like an excellent rule, and in most cases, it would be. It has very high confidence and very high support. However, what if convenience store customers in general buy milk 90% of the time? In that case, orange juice customers are actually less likely to buy milk than customers in general.
A third measure is needed to evaluate the quality of the rule. Lift indicates the strength of a rule over the random co-occurrence of the antecedent and the consequent, given their individual support. It provides information about the improvement, the increase in probability of the consequent given the antecedent. Lift is defined as follows.
(Rule Support) /(Support(Antecedent) * Support(Consequent))
This can also be defined as the confidence of the combination of items divided by the support of the consequent. So in our milk example, assuming that 40% of the customers buy orange juice, the improvement would be:
30% / (40% * 90%)
which is 0.83 – an improvement of less than 1.
Any rule with an improvement of less than 1 does not indicate a real cross-selling opportunity, no matter how high its support and confidence, because it actually offers less ability to predict a purchase than does random chance.
This example shows association rules mined from sales transactions in the SH
schema. The data is stored in SH.SALES
, SH.PRODUCTS
, and SH.CHANNELS
. The build settings and some of the products for sale are shown in Figure 8-3.
Figure 8-3 Build Settings for Association Rules
Twenty four rules, shown in Figure 8-4, are generated with these settings for this data.
Figure 8-4 Sample Association Rules for Product Sales
You can query the rules in different ways as shown in Figure 8-5.
Figure 8-5 Selecting Rules for Examination
The rules that have "Mouse Pad" as the consequent are shown in Figure 8-6.
Oracle Data Mining uses the Apriori algorithm for association rules. Apriori works by iteratively enumerating itemsets of increasing lengths subject to the minimum support threshold.
See Also:
Chapter 10, "Apriori"