Creating a Data Subset Definition

The procedure described in this section enables you to create a subset database, after which you can perform other tasks, such as editing the properties of the subset definition or exporting a subset definition.

The interface also allows you to perform inline, or at the source, masking while creating the subset definition.

Before proceeding, ensure that you have the following privileges:

EM_ALL_OPERATOR for Enterprise Manager Cloud Control users
SELECT_CATALOG_ROLE for database users
SELECT_ANY_DICTIONARY privilege for database users
Additionally, to perform an in-place delete operation, the DBA user must be granted the EXECUTE_ANY_TYPE privilege

To create a data subset definition:

From the Enterprise menu, select Quality Management, then Data Subset Definitions.
Open the Actions menu in the Data Subset Definitions page, then select Create, or just click the Create icon.
Define the data subset definition properties:
1. Provide the requisite information in the General pop-up that appears, then click Continue.
  
  You can select any source database associated with the Application Data Model.
  
  If you are performing masking within the subset definition, you must select the same ADM and target used in creating the masking definition.
2. Provide a job name, credentials, and specify a schedule in the Schedule Application Detail Collection pop-up that appears, then click Submit.
  
  If you want to use new credentials, choose the New Credentials option. Otherwise, choose the Preferred Credentials or Named Credentials option.
The space estimate collection job runs, and then displays the Data Subset Definitions page. Your definition appears in the table, and the Most Recent Job Status column should indicate Scheduled, Running, or Succeeded, depending on the schedule option selected and time required to complete the job.
Select the definition within the table, open the Actions menu, then select Edit.

The Database Login page appears.
Select either Named Credentials or New Credentials if you have not already set preferred credentials, then click Login.
In the Applications subpage of the Edit page, move applications from the Available list to the Selected list as follows:
- If you intend only to mask the data (no subsetting), select all applications.
- If you intend only to subset the data (no masking), select specific applications as appropriate.
- If you intend both to subset and mask the data, the applications selected must include those that the masking definitions require.
The names of application suites, applications, or application modules are maintained in the Application Data Model.
Click the Table Rules tab.

Note:

If you are masking only, set the Default Table Rules option to include all rows and skip to Step 13. The Column Rules tab, Rule Parameters tab, and additional features on the Table Rules tab pertain specifically to subsetting.

You can add rules here to define the data to include in the subset.
Select Actions, then Create to display the Table Rule pop-up, or just click the Create icon.
1. Select the application for which you want to provide a rule.
  
  Associate the rule with all tables, a specific table, or a category of tables.
2. In the Rows to Include section, select the option that best suits your needs for a representative sample of production data. If you do not want to include all rows, you can include some rows by specifying a percentage portion of the rows. For finer granularity, you could specify a Where clause, such as where region_id=6.
  
  For more information on specifying Where clauses, see Step 8.e.
3. In the Include Related Rows section, do one of the following:
  - Select Ancestor and Descendant Tables
    
    This rule impacts the parent and child columns, and ensures that referential integrity is maintained, and that child columns are also selected as part of the subset.
  - Select Ancestor Tables Only
    
    This rule only impacts the parent columns, and ensures that referential integrity is maintained.
  If you disable the Include Related Rows check box, referential integrity may not be maintained. However, you can subsequently provide additional rules to restore referential integrity. You can disable this check box whether or not you specify a Where clause.
4. If you want to specify a Where clause, go to the next step. Otherwise, skip to Step 9.
5. Provide a rule parameter, if desired, for the clause.
  
  For instance, if you specify a particular value for an employee ID as employee_id=:emp_id, you could enter query values for the default of 100:
  - Select the Rows Where button and enter employee_id=:emp_id.
  - Click OK to save the rule and return to the Table Rules tab.
    
    If this is a new rule, a warning appears stating that "Rule parameters corresponding to the bind variables 'emp_id' should be created before generating subset."
  - Select the table rule, click the Rule Parameters tab, then click Create.
    
    The Rule Parameter Properties pop-up appears.
  - Enter emp_id for the Name and 100 for the Value.
    
    Note:
    
    The colon ( : ) preceding emp_id is only present in the Where clause, and not required when creating a new rule parameter.
  - Click OK to save the properties, which now appear in the Rule Parameters tab.
  - Skip to Step 10.
Click OK to save the rule and return to the Table Rules tab.

The new rule is displayed in the list. The related tables are displayed in the table below. Related rows from the tables are included in the subset to provide referential integrity in the subset database.
In the Related Tables section of the Table Rules tab, you can manage the size of the subset by controlling the levels of ancestors and descendants within the subset. Notice that each node in the table has a check box. By default, all nodes are included in the subset, as indicated by the check mark. Deselect the check box to exclude a node from the subset. The deselection option is disabled for parent rows (the join columns to the right identify parent and child rows). In addition, you can make these other refinements to subset content:
- Click Allow Excluding Parent Tables. This enables the check marks that were grayed out. You can now selectively exclude parent rows from the subset by deselecting the check box.
- Select a node within the table and click Add Descendants to include related rows. In the dialog that opens, make appropriate selections and click OK.
As you make these refinements, columns on the right reflect the effect on space estimates for the subset. The Related Tables section also denotes the processing order of the ancestor and descendant tables, including the detailed impact of including each table. When you are done with the refinements, go to the Space Estimates tab to see a finer granularity of the impact on the overall size of the subset.
In the Default Table Rows section of the Table Rules tab, choose whether you want to include or exclude the tables not affected by the defined rules in the subset.

When you select the Include All Rows option, all of the rows for the table are selected as part of the subset.

This is a global rule and applies to the entire subset. You can only select the Include All Rows option when all of the rules have a scope of None. A scope of None is established when you uncheck the Include Related Rows option in the Table Rule pop-up.

Note:

For a subset definition that has column rules (see Step 12), be sure to use table rules to include the corresponding tables. You can use the Default Table Rules option to include all tables not affected by table rules, if required.
Optional: Click the Column Rules tab to perform inline masking as part of the subset definition.
1. Click Create and enter search criteria to filter on columns within the schema. These would typically be vertical columns such as CLOB AND BLOB columns.
  
  Note:
  
  If you are using column rules instead of masking definitions (see Step 13), you can select no more than 10 columns in a given table. This restriction applies to the export method but not to the in-place delete method.
  
  Click OK.
2. Select a row or rows in the column search results and click Manage Masking Formats.
3. In the pop-up dialog, select a masking format and value to apply to the columns. For multiselection, the same format must be appropriate for all columns. If you select multiple columns, ensure that the column rule format you choose is applicable to the selected columns. Use the columns (flags) not null and unique to enforce compliance.
  
  Click OK to apply the masking format to the columns.
Optional: Click the Data Masking tab to include masking definitions as part of the subsetting operation or to perform at the source data masking only.
1. Click Add.
2. In the pop-up dialog, enter search criteria to retrieve appropriate definitions. Be sure to select the desired radio button (All or Any). All formats except compound masking are supported for inline masking.
  
  Note:
  
  No single table within a masking definition can have more than 10 masked columns if you are using the export method. The restriction does not apply to the in-place delete method.
  
  Click OK.
The search results appear in the data masking table.
Click the Space Estimates tab.
- Note the value in the Estimated Subset Size MB column. The space estimates depend on optimizer statistics, and the actual distribution of data can only be calculated if histogram statistics are present.
- Whenever you add new rules, recheck the space estimates for updated values.
- Data in the Space Estimates subpage is sorted with the largest applications appearing at the top.
Note:

Space estimates do not reflect the effect of data masking, if used.

If you provide a Where clause and subsequent rule parameter properties, the Space Estimates subpage is updated with the value contained in the Rule Parameters tab.
Optional: click the Pre/Post Subset Scripts tab.
- You can specify a pre-subset script to run on the subset database before you select subset data.
- You can specify a post-subset script to run on the subset database after you assemble the subset data.
- Either script type runs on the source database.
Click Return.

The definition is complete and displayed in the Data Subset Definitions table.

You can now proceed with script generation. Alternatively, you may want to save the script for future use. In either case, you must decide whether to export data to a dump file or delete data from a target database.

Tip:

If you have a very large database of 4 terabytes, for instance, and you want to export a small percentage of the rows, such as 10%, it is more advantageous to use the export method. Using the in-place delete method would require 3.6 terabytes of data, which would not perform as quickly as the export method.

The in-place delete method is recommended when the amount of data being deleted is a small percentage of the overall data size.

There is an EMCLI verb if you want to perform an in-place delete remotely or script it.