10 Apriori

This chapter describes Apriori, the algorithm used by Oracle Data Mining for calculating association rules.

About Apriori

An association mining problem can be decomposed into two subproblems:

Find all combinations of items, called frequent itemsets, whose support is greater than the specified minimum support.
Use the frequent itemsets to generate the desired rules.

A rule consists of an antecedent and a consequent. The antecedent describes a condition. The consequent describes the result implied by the condition. For example, in the rule "ABC implies D," ABC is the antecedent, and D is the consequent. Oracle Data Mining association supports single consequent rules only.

The Apriori algorithm works by iteratively enumerating itemsets of increasing lengths subject to the minimum support threshold.

You can use model settings to specify the maximum length and the minimum support and confidence for rules. These settings apply to the association mining function. See Chapter 8, "Association".

Association rule mining is not recommended for finding associations involving rare events in problem domains with a large number of items. Classification models may be more suitable in such problem domains.

Apriori discovers patterns with frequency above the minimum support threshold. Therefore, in order to find associations involving rare events, the algorithm must run with very low minimum support values. However, doing so could potentially explode the number of enumerated itemsets, especially in cases with a large number of items. This could increase the execution time significantly.

Data for Association Rules

Association models are designed to use transactional data. Nulls in transactional data are assumed to represent values that are known but not present in the transaction. For example, three items out of hundreds of possible items might be purchased in a single transaction. The items that were not purchased are known but not present in the transaction.

Transactional data, by its nature, is sparse. Only a small fraction of the attributes are nonzero or non-null in any given row. Apriori interprets all null values as indications of sparsity.

Examples of sparse data include market basket and text mining data. In a market basket problem, there might be 1,000 products in the company's catalog, and the average size of a basket (the collection of items that a customer purchases in a typical transaction) might be 20 products. In this example, a transaction (case or record) has on average 20 out of 1000 attributes that are not null. This implies that the fraction of nonzero attributes in the table (or the density) is 20/1000, or 2%. This density is typical for market basket and text processing problems. Data that has a significantly higher density can require extremely large amounts of temporary space to build associations.