9 Feature Selection and Extraction

This chapter describes the feature selection and extraction mining functions. Oracle Data Mining supports a supervised form of feature selection and an unsupervised form of feature extraction.

See Also:

"Supervised Data Mining"

"Unsupervised Data Mining"

This chapter contains the following sections:

Finding the Best Attributes
Feature Selection
Feature Extraction
Algorithms for Feature Selection and Extraction

Finding the Best Attributes

Sometimes too much information can reduce the effectiveness of data mining. Some of the columns of data attributes assembled for building and testing a model may not contribute meaningful information to the model. Some may actually detract from the quality and accuracy of the model.

For example, you might collect a great deal of data about a given population because you want to predict the likelihood of a certain illness within this group. Some of this information, perhaps much of it, will have little or no effect on susceptibility to the illness. Attributes such as the number of cars per household will probably have no effect whatsoever.

Irrelevant attributes simply add noise to the data and affect model accuracy. Noise increases the size of the model and the time and system resources needed for model building and scoring.

Moreover, data sets with many attributes may contain groups of attributes that are correlated. These attributes may actually be measuring the same underlying feature. Their presence together in the build data can skew the logic of the algorithm and affect the accuracy of the model.

Wide data (many attributes) generally presents processing challenges for data mining algorithms. Model attributes are the dimensions of the processing space used by the algorithm. The higher the dimensionality of the processing space, the higher the degree of complexity involved in algorithmic processing.

Note:

Oracle Data Mining supports two algorithms that are uniquely optimized for processing highly-dimensioned data: Support Vector Machine and Generalized Linear Models. See Chapter 18 and Chapter 12.

To minimize the effects of noise, correlation, and high dimensionality, some form of dimension reduction is often a desirable preprocessing step for data mining. Feature selection and extraction are two approaches to dimension reduction.

Feature selection — Selecting the most relevant attributes
Feature extraction — Combining attributes into a new reduced set of features

Feature Selection

Oracle Data Mining supports feature selection in the attribute importance mining function. Attribute importance is a supervised function that ranks attributes according to their significance in predicting a target.

Finding the most significant predictors is the goal of some data mining projects. For example, a model might seek to find the principal characteristics of clients who pose a high credit risk.

Attribute importance is also useful as a preprocessing step in classification modeling, especially for models that use Naive Bayes or Support Vector Machine. The Decision Tree algorithm includes components that rank attributes as part of the model build.

Oracle Data Mining does not support the scoring operation for attribute importance. The results of attribute importance are the attributes of the build data ranked according to their predictive influence.

Data for Attribute Importance

Figure 9-1 shows six columns and ten rows from the case table used to build the Oracle Data Mining sample attribute importance model, ai_sh_sample. A target value of 1 has been assigned to customers who increased spending with an affinity card; a value of 0 has been assigned to customers who did not increase spending.

Figure 9-1 Sample Build Data for Attribute Importance

Description of "Figure 9-1 Sample Build Data for Attribute Importance"

Example: Attribute Importance

Figure 9-2 shows the results returned by the model ai_sh_sample. The attributes used to build the model are ranked in order of their significance in predicting the target. The results show that household size and marital status have the most effect on whether or not the customer will increase spending with an affinity card.

Figure 9-2 Attribute Importance in Oracle Data Miner

Description of "Figure 9-2 Attribute Importance in Oracle Data Miner"

Negative ranking indicate noise. Attributes ranked at zero or less do not contribute to the prediction and should probably be removed from the data.

It is important to keep in mind that the model ranks the attributes in relation to each other. Given the collection of attributes shown in Figure 9-2, household size is approximately twice as important as length of residence in predicting spending with an affinity card. This does not mean that these factors would have the same weight if additional factors (new attributes) were taken into consideration or if some existing attributes were removed.

See Also:

Oracle Data Mining Administrator's Guide for information about the Oracle Data Mining sample programs

Example: Predictive Analytics EXPLAIN

The predictive analytics EXPLAIN operation also implements attribute importance. Figure 9-3 shows EXPLAIN results in Microsoft Excel, using the Oracle Data Mining Spreadsheet Add-In for Predictive Analytics.

The results show that the attribute RELATIONSHIP is the most important in predicting the target designated for this data set. Attributes ranked 13 or greater are noise.

Figure 9-3 EXPLAIN in the Spreadsheet Add-In for Predictive Analytics

Description of "Figure 9-3 EXPLAIN in the Spreadsheet Add-In for Predictive Analytics"

See Also:

Chapter 3 for information about Oracle predictive analytics

Feature Extraction

Feature extraction is an attribute reduction process. Unlike feature selection, which ranks the existing attributes according to their predictive significance, feature extraction actually transforms the attributes. The transformed attributes, or features, are linear combinations of the original attributes.

The feature extraction process results in a much smaller and richer set of attributes. The maximum number of features is controlled by the FEAT_NUM_FEATURES build setting for feature extraction models.

Models built on extracted features may be of higher quality, because the data is described by fewer, more meaningful attributes.

Feature extraction projects a data set with high dimensionality onto a smaller number of dimensions. As such it is useful for data visualization, since a complex data set can be effectively visualized when it is reduced to two or three dimensions.

Some applications of feature extraction are latent semantic analysis, data compression, data decomposition and projection, and pattern recognition. Feature extraction can also be used to enhance the speed and effectiveness of supervised learning.

For example, feature extraction can be used to extract the themes of a document collection, where documents are represented by a set of key words and their frequencies. Each theme (feature) is represented by a combination of keywords. The documents in the collection can then be expressed in terms of the discovered themes.

Data for Feature Extraction

Figure 9-4 shows the columns in the case table used to build the Oracle Data Mining sample feature extraction model, nmf_sh_sample. The CUST_ID column holds the case identifier. No column is designated as a target for feature extraction since the algorithm is unsupervised.

Figure 9-4 Sample Build Data for Feature Extraction

Description of "Figure 9-4 Sample Build Data for Feature Extraction"

Example: Features Created from Build Data

Figure 9-5 shows information about a feature in Oracle Data Miner. It shows the attribute values with coefficients between .05 and .09 for feature 5 extracted from the build data. The coefficients identify the position of the attribute in the linear combination of attributes that constitute the feature. A higher coefficient indicates a higher influence of the attribute on the feature.

Figure 9-5 Feature Details

Description of "Figure 9-5 Feature Details"

Typically these features would be used as input to another model. For an example, see "Sample Text Mining Problem".

Example: Scored Features

Figure 9-6 shows some of the results when the model is applied to a different set of customer data. Cases are assigned to features with a value that indicates the importance of that case in that feature.

Figure 9-6 Features in Scored Data

Description of "Figure 9-6 Features in Scored Data"

See Also:

Oracle Data Mining Administrator's Guide for information about the Oracle Data Mining sample models

Algorithms for Feature Selection and Extraction

Oracle Data Mining uses the Minimum Description Length (MDL) algorithm for feature selection (attribute importance).

Oracle Data Mining uses the Non-Negative Matrix Factorization (NMF) algorithm for feature extraction.

See Oracle Data Mining Application Developer's Guide for information about feature extraction for text mining.