9 Managing Content Categorizer

This section covers the following topics:

"About Content Categorizer"
"Setting Up Content Categorizer"
"Search Rules"
"Sample doc_config.htm Page"
"XSLT Transformation"

9.1 About Content Categorizer

Content Categorizer (CC) is an optional component that is automatically installed with Content Server. When enabled, Content Categorizer suggests metadata values for new documents being checked into Content Server, and for existing documents into Content Server that may or may not already have metadata values. These metadata values are determined according to search rules provided by the System Administrator.

The Batch utility that is included with the component can search a large number of files and create a Batch Loader control file containing appropriate metadata field values. The Batch utility can be used to recategorize existing content (already checked into the content server repository).

Content Categorizer suggests metadata values for new documents being checked into Content Server, and for existing documents into Content Server that may or may not already have metadata values. These metadata values are determined according to search rules provided by the System Administrator.

This section covers the following topics:

"Search Rules"
"XML Conversion"
"Operating Requirements"
"Operating Modes"

9.1.1 Search Rules

Content Categorizer executes its search rules depending on the type of rule defined:

Pattern Matching and Abstract Rules: Content Categorizer scans a content document looking for "landmarks". A landmark can be specific text, or it can be based on structural properties of the source document, such as styles, fonts, and formatting.
Option List Rule: Content Categorizer searches for keywords whose cumulative score determines which option of an option list is selected. It does not look for either landmarks or specific XML tags.
Categorization Engine Rule: Content Categorizer invokes a 3rd-party categorizer engine and taxonomy to categorize a content item.
Filetype Rule: Content Categorizer looks for the document file type (the filename extension).

Search Rule Override

Normally, a user-entered value on the Content Check In Form prevents Content Categorizer from applying the search rules for that field. This is also true for option list fields that have a default value, such as the Type field.

Important:

It is important to instruct contributors to leave blank any fields they want to have filled by search rules.

The current version of Content Server automatically inserts a blank value as the default value in a custom option list field. In this case, the first value (by default, a blank value) will not be considered a user-entered value, and the Option List search rule will be applied. If you do not want the Option List search rule to override the first value in a custom option list field, you must provide a default value for that option list on the Configuration Manager Applet.
To have Content Categorizer ignore the default value and apply the search rules to the Type field, you can edit the Content Server configuration file. See "Applying Rules to the Type Field".

9.1.2 XML Conversion

For Content Categorizer to recognize structural properties, the content must be converted to XML (eXtensible Markup Language). The conversion method is a user-defined runtime configuration setting. The variable name is sccXMLConversion and its possible values include Flexiondoc, SearchML and None. The None value is used for files that are already XML.

The CC_Sample directory that was installed with Content Categorizer includes a sample source file named Wellington_WordStyle.doc that is "artificially rich" with document properties and styles. The directory also contains sample XML files (Wellington_WordStyle_flexion.xml and Wellington_WordStyle_searchml.xml) that demonstrate the XML that results when the source document is converted by each of the available XML converters.

Important:

There is a problem with the XSLT transformation used to post-process PDF content converted using the Flexiondoc schema. When Flexiondoc schema are used, single words are assigned to individual XML elements, making the final XML unusable. It is therefore necessary to use SearchML for categorizing PDF content.

Regardless of which XML converter is specified, the XML intermediate files are used only by Content Categorizer, so they are discarded after use, and documents are checked into Content Server in their original source form. The only exception is content that is already in XML format, which is not subjected to the translation process.

This section covers the following topics:

"Flexiondoc XML Converter"
"SearchML XML Converter"

9.1.2.1 Flexiondoc XML Converter

The OutsideIn XML Export technology is used in combination with a custom XSLT style sheet (flexiondoc_to_scc.xsl) to produce XML in a two-stage process. In the first stage, the native document is converted to Flexiondoc-formatted XML. In the second stage, the style sheet is used to further refine the XML so that it is searchable by Content Categorizer. Native document properties and text segments are isolated in XML elements, which are named after the corresponding document property, paragraph style, or character style.

9.1.2.2 SearchML XML Converter

The OutsideIn technology is used in combination with a custom XSLT style sheet (searchml_to_scc.xsl to produce XML in a two-stage process. In the first stage, the native document is converted to SearchML-formatted XML. In the second stage, the style sheet is used to further refine the XML so that it is searchable by Content Categorizer. Native document properties and text segments are isolated in XML elements, which are named after the corresponding document property or paragraph style. Character styles are not supported by SearchML.

9.1.3 Operating Requirements

To run Content Categorizer, these settings are required:

Define the XML Conversion method in Content Categorizer as one of these:
- sccXMLConversion=Flexiondoc
- sccXMLConversion=SearchML
See "Setting XML Conversion Method".
Define search rules in the Content Categorizer Admin applet. See "Defining Search Rules".
Optional: Define field properties, including default values for the metadata fields in the Content Categorizer Admin Applet. See "Defining Field Properties (Optional)" for more information.

Important:
To use the CATEGORY search rule, you must install, set up and register a categorizer engine before you can define the CATEGORY rule for any metadata fields.

9.1.4 Operating Modes

Content Categorizer can operate in either Interactive mode or Batch mode. All modes require conversion of the source documents into XML intermediate form. However, the process flows of the modes are distinctly different.

Interactive mode integrates Content Categorizer with the Content Check In Form and Info Update Form in Content Server. Users click the Categorize button on the form to run Content Categorizer on a single content item. Any value that is returned by Content Categorizer is a "suggested" value, because the contributor can edit or replace the returned value.
Batch mode is used when recategorizing large numbers of documents that are already in the content server repository. The system administrator uses a stand-alone Batch Categorizer utility to run Content Categorizer, and then either performs a "live update" of content metadata or uses the output file from Batch Categorizer as input to the Batch Loader.

This section covers the following topics:

"Interactive Mode Process"
"Batch Mode: Process"

9.1.4.1 Interactive Mode Process

The following steps occur during the checkin process:

A contributor displays the Content Check In Form or the Info Update Form, selects a primary file (only on Content Check In Form), and clicks the Categorize button.
The Content Check In Form copies the primary file to the Content Server host and calls the Content Categorizer service.
Content Categorizer locates the source content.
If the content is already in XML format, no translation occurs, and the process continues at step 6.
If the content is not already in XML format, it will be converted using the specified conversion method:

Flexiondoc
- The content is converted into Flexiondoc-formatted XML.
- The XML is translated into "Content Categorizer-friendly" XML, using flexiondoc_to_scc.xsl.
SearchML
- The content is converted into SearchML-formatted XML.
- The XML is translated into "Content Categorizer-friendly" XML, using searchml_to_scc.xsl.
Content Categorizer applies the search rules to the XML and obtains suggested values for the specified metadata fields.
Content Categorizer inserts the suggested metadata values into the Content Check In Form or Update Info Form, and returns the form to the contributor.
The contributor can check in or submit the document with the suggested values, revise the metadata values, or cancel the checkin or update.

If the optional AddCCToNewCheckin component is installed and enabled, clicking Check In on the Content Check In Form performs steps 2 through 6, above, and automatically completes the check in process, provided the properties for dDocTitle are set to Override Contents. If the properties of dDocTitle are not set to Override Contents, then an alert is displayed requesting that the required field is completed. Field properties are set using the CC Admin Applet. See "Defining Field Properties (Optional)".

9.1.4.2 Batch Mode: Process

The system administrator performs the following steps during this process:

Run the Batch Categorizer application. The application may also be started in Windows by clicking Start, Programs, Content Server, <instance_name>, and Batch Categorizer. See the Oracle Fusion Middleware System Administrator's Guide for Content Server for details about running applications on UNIX systems.

The Batch Categorizer Screen is displayed.
On the Batch Categorizer screen, define filters and release date information to display a list of content to be categorized. Options on the Batch Categorizer screen (as described in the following steps) enable you to further define the exact content to be categorized.
Click Categorize.

The Categorize Existing Screen is displayed.
Select Live Update or Batch Loader.
- Use Live Update option to update the data in the repository immediately.
- Use Batch Loader option to create a control file, which is the output of the Content Categorizer process. The file contains an entry for each source document, and contains the values for each metadata field based on the search rules defined in Content Categorizer.
  
  Tip:
  You can edit/filter this file before submitting it to Batch Loader, or submit it directly to Batch Loader; see next step.
To run the Batch Loader utility automatically after the Content Categorizer process is complete, select the Run Batch Loader check box.
Enter the location and file name for the log file. The log file will contain error information about the Content Categorizer process.
Choose Categorize All or Categorize Selected.
- Use the Categorize All option to categorize all the content items displayed in the content list.
- Use the Categorize Selected option to categorize only the selected (highlighted) content items displayed in the content list.
Choose to categorize Latest Revision or All Revisions.
- Use the Latest Revision option to categorize only the most recent revision of the content items displayed in the content list.
- Use the All Revisions option to categorize all revisions of the content items displayed in the content list.
Choose to continue or discontinue the categorization process when Batch Categorizer encounters an error.
Click OK. The Progress bar shows the progress as the batch process moves through its steps:
1. Content Categorizer locates the source content.
2. If the content is already in XML format, no translation occurs, and the process continues at step d.
3. If the content is not already in XML format, conversion into XML occurs using the selected XML conversion method: Flexiondoc or SearchML.
4. Content Categorizer applies the search rules to the XML and obtains values for the specified metadata fields.
5. If Live Update was specified, database records are updated immediately. If Batch Loader was specified, an output control file is created, and the Batch Loader utility is run, if the option to do so after processing was specified.
When the batch process is complete, review the error logs. Errors encountered by Batch Categorizer are displayed on the console and also recorded in the batch categorizer log (if specified). Errors encountered by Batch Loader are displayed on the console and also recorded in the Content Server system log.

If the optional AddCCToArchiveCheckin component is installed and enabled, all content loaded into content server using the Batchloader utility is categorized automatically, based on predefined rule sets. For more information about defining rule sets, see.

9.2 Setting Up Content Categorizer

Before using Content Categorizer, you must install and configure the necessary software.

This section covers the following topics:

"Setting XML Conversion Method"
"Defining Field Properties (Optional)"
"Configuration Variable"

9.2.1 Setting XML Conversion Method

When operating in Interactive mode or Batch mode, the method that Content Categorizer uses to convert native documents into XML is set as a runtime configuration parameter.

To set the XML conversion method in Content Categorizer:

Log into the Content Server as the system administrator.
Click the Administration link.
Click the Content Categorizer Administration link (under Administration Pages for instance_name).

The Content Categorizer Admin Applet Page is displayed.
On the Configuration tab, select the sccXMLConversion property and click Edit, or double-click the property.

The Property Config Screen is displayed.
From the drop-down list, select the desired XML conversion method:
- Flexiondoc
- SearchML
Click OK.
Click Apply to save the changes, or click OK to save the changes and close the CC Admin Applet, Defining Field Properties

9.2.2 Defining Field Properties (Optional)

When any rule for a field succeeds, the found value is used (in either Batch Loader operations or Live Update operations). However, depending on how the Override value is set, the found value will not override the existing value (Override is set to false).

When all rules for a field fail, no value is assigned to the field. This is applicable unless a default value is defined for the field and Use Default is set to true.

Important:

The Content Server Batch Loader utility will fail any insert action that does not have a value for a required field.

To define field properties for the metadata fields in your system:

Open the Content Categorizer Admin Applet.
Click the Field Properties tab.
Select a metadata field to be edited and click Edit, or double-click the field.

The Field Properties Screen is displayed.
Enter a default value for the field.

The default value for an option list field must match one of the values available for that field.
Select the Override check box if you want the value returned by the categorization process to override an existing value for the field.
Select the Use Default check box if you want the field's default value to be used if all rules fail (or are not defined) when the categorization process runs.
Click OK.
Repeat steps 3 through 7 for each field to be edited.
Click Save Settings to save the changes.

9.2.3 Configuration Variable

The MaxQueryRows variable is a Content Server configuration variable and is used to specify the maximum number of documents that can be included in a single batch load process. As such, it affects how many documents a user will see in BatchCategorizer.

The default setting for this configuration variable is 200 but can be decreased or increased as necessary. Increasing the value will slow the response time for loading a large list of documents. Although the impact of setting MaxQueryRows to something in the range of 1000 to 2000 is minor, setting it in the area of 100,000 would probably produce an unacceptable performance level.

The format for this variable is as follows:

MaxQueryRows=2000

9.3 Search Rules

This section covers the following topics:

"Understanding Search Rules"
"Pattern Matching Search Rules"
"Abstract Search Rules"
"Option List Search Rule"
"Categorization Engine Search Rule"
"Filetype Search Rule"
"Defining Search Rules"

9.3.1 Understanding Search Rules

Search rules define how Content Categorizer determines metadata values to return to the Content Check In Form or Info Update Form (for Interactive mode) or the batch file (for Batch mode).

Every search rule is defined by:

A rule type, which determines the method that Content Categorizer uses to search the XML document.
A key, which defines the XML element, phrase, or keyword that Content Categorizer looks for in the document, or the categorization engine/taxonomy that Content Categorizer uses to classify the document.
A count, which is used to refine the search criteria.

Keys and counts are explained in more detail in the Help topics for each search rule type.

This section covers the following topics:

"Search Rule Types"
"Search Rule Guidelines"

9.3.1.1 Search Rule Types

Metadata values can be derived using these methods:

A Pattern Matching search rule looks for specific text or a specific XML element and returns an associated value.
An Abstract search rule looks for an XML element and returns a descriptive sentence or paragraph from that element.
An Option List search rule looks for keywords within the source document, applies a score for each keyword found, and returns the option list value that has the highest keyword score.
A Categorization Engine search rule uses a third-party categorization engine and a defined taxonomy to determine appropriate metadata values.
A Filetype search rule examines the filename extension of the primary file and returns a term associated with that filename extension.

9.3.1.2 Search Rule Guidelines

Search rules can be applied to any custom metadata field.
Search rules can be applied to the Title, Comments, and Type standard metadata fields. Search rules cannot be defined for any other standard metadata fields (such as Author, Security Group, and Account).
Multiple search rules can be defined for a metadata field. (For a single metadata field, however, multiple CATEGORY rules that refer to different taxonomies are not supported.)
Multiple search rules are run in the order specified, so that if a search rule does not result in a suggested value, the next rule is run. The list should be arranged from most to least specific.
Search rule types can be mixed within a metadata field. For example, you can define an Option List rule, a Pattern Matching rule, and an Abstract rule for the same metadata field.
If none of the search rules specified for a metadata field can be satisfied, the field is left blank.

9.3.2 Pattern Matching Search Rules

Pattern Matching search rules look for specific text or a specific XML element and return an associated value. For example, the Invoice # metadata field can be filled by the value that follows an Invoice: or Invoice Number: label in the source document, or it can be filled by the value that is within the <Invoice> tag in the XML document.

This section covers the following topics:

"Rule Types"
"Key"
"Count"
"Examples"

9.3.2.1 Rule Types

There are two general types of Pattern Matching rules: Tag Search and Text Search.

Tag Search searches for an XML element that exactly matches the key. If such an element is found, the text contained in the element is returned as the result.
Text Search searches for text that matches the key. If such text is found, the text near or following the key is returned as the result.

Tag Searches are case sensitive. Text Searches are not case sensitive.

Sub-Types

Within each of the two general types of Pattern Matching search rules, there are several sub-types. These sub-types are explained in more detail in the Examples section below.

Tag Search

TAG_TEXT
TAG_ALLTEXT

Text Search

TEXT_REMAINDER
TEXT_ALLREMAINDER
TEXT_FULL
TEXT_ALLFULL
TEXT_NEXT
TEXT_ALLNEXT

9.3.2.2 Key

The key for a Pattern Matching search rule is either an XML element (for a Tag Search) or a text phrase (for a Text Search).

9.3.2.3 Count

The count for a Pattern Matching search rule defines the number of tags or text phrases that must be matched before the rule returns results. For example, a count of 4 will look for the fourth occurrence of the key. If only three occurrences of the key are found in the document, the rule fails.

The default count of 1 returns the first occurrence of the key.

9.3.2.4 Examples

The following examples illustrate the use of the Pattern Matching search rules.

This section covers the following topics:

"TAG_TEXT"
"TAG_ALLTEXT"
"TEXT_REMAINDER"
"TEXT_ALLREMAINDER"
"TEXT_FULL"
"TEXT_ALLFULL"
"TEXT_NEXT"
"TEXT_ALLNEXT"

9.3.2.4.1 TAG_TEXT

TAG_TEXT searches for an XML element name that matches the key exactly (including case). If such an element is found, all text that belongs to the element is concatenated and returned as the result.

Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>

<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TAG_TEXT
Key: TAG_A
Returns: Title: The Big Wolf

9.3.2.4.2 TAG_ALLTEXT

TAG_ALLTEXT searches for an XML element name that matches the key exactly (including case). If such an element is found, all text that belongs to the element, and to all children of the element, is concatenated and returned as the result.

Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>

<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TAG_ALLTEXT
Key: TAG_A
Returns: Title: The Big Bad Wolf

9.3.2.4.3 TEXT_REMAINDER

TEXT_REMAINDER searches for text that matches the key exactly (except for case). If such text is found, any text following the key that belongs to the same XML element is returned as the result.

Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>

TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_REMAINDER
Key: Title
Returns: The Big Wolf

9.3.2.4.4 TEXT_ALLREMAINDER

TEXT_ALLREMAINDER searches for text that matches the key exactly (except for case). If such text is found, any text following the key that belongs to the same XML element, and to all children of the element, is returned as the result.

Content: TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>

<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_ALLREMAINDER
Key: Title:
Returns: The Big Bad Wolf

9.3.2.4.5 TEXT_FULL

TEXT_FULL searches for text that matches the key exactly (except for case). If such text is found, any text that belongs to the same XML element, including the key text, is returned as the result.

Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>

<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_FULL
Key: Title:
Returns: Title: The Big Wolf

9.3.2.4.6 TEXT_ALLFULL

TEXT_ALLFULL searches for text that matches the key exactly (except for case). If such text is found, any text that belongs to the same XML element, including the key text and any text belonging to children of the element, is returned as the result.

Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>

<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_ALLFULL
Key: Title:
Returns: Title: The Big Bad Wolf

9.3.2.4.7 TEXT_NEXT

TEXT_NEXT searches for text that matches the key exactly (except for case). If such text is found, any text that belongs to the next non-blank XML element is returned as the result. Blank elements and elements composed of non-printing characters will not be selected as the return value.

Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>

<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_NEXT
Key: Title:
Returns: Subtitle: A Play

9.3.2.4.8 TEXT_ALLNEXT

TEXT_ALLNEXT searches for text that matches the key exactly (except for case). If such text is found, any text that belongs to the next non-blank XML element, and to all children of the element, is returned as the result. Blank elements and elements composed of non-printing characters will not be selected as the return value.

Content: <TAG_A>Title: The Big <TAG_B>Bad</TAG_B> Wolf</TAG_A>

<TAG_C>Subtitle: A <TAG_D>Morality</TAG_D> Play</TAG_C>
Rule: TEXT_ALLNEXT
Key: Title:
Returns: Subtitle: A Morality Play

9.3.3 Abstract Search Rules

Abstract search rules look for an XML element and return a descriptive sentence or paragraph from that element. For example, the Summary metadata field could be filled by a returned value of "Germany is a large country in size, culture, and worldwide economics. One of Germany's largest industries includes the manufacturing of world class automobiles like BMW, Mercedes, and Audi."

The Abstract rule type is useful where there is no readily identifiable or explicitly tagged block of text in the content item. Typically, these rules are used to suggest summary or topic information about the document.

This section covers the following topics:

"Rule Types"
"Key"
"Count"
"Examples"

9.3.3.1 Rule Types

There are two Abstract search rules: First Paragraph and First Sentence.

First Paragraph searches for an XML element that exactly matches the key. The entire paragraph of the first such element that meets the size criteria (specified by the count) is returned as the result.
First Sentence searches for an XML element that exactly matches the key. If such an element is found, the first sentence of the element is returned as the result.

9.3.3.2 Key

The key for an Abstract search rule is an XML element.

9.3.3.3 Count

The count is interpreted differently for the First Paragraph and First Sentence search rules.

First Paragraph

For a First Paragraph search rule, the count is a size threshold measured in percent:

The rule searches the document for all paragraphs that match the key.
The rule calculates the average size (based on character count) of the paragraphs that match the key.
The rule multiplies the average size by the count percentage (0 = 0%, 100 = 100%).
The rule looks for the first paragraph larger than the resulting number.

For example, if the count is set to 75 and the average paragraph size is 100 characters, the rule returns the first paragraph larger than 75 characters that matches the key.

If the count is set to the default of 1, the rule is likely to return the first paragraph that matches the key.

First Sentence

For a First Sentence search rule, the count is the number of elements that have their first sentences returned.

For example, if the count is set to 3, the rule returns the first sentence from each of the first three elements that match the key.

9.3.3.4 Examples

The following examples illustrate the use of the Abstract search rules:

"FIRST_PARAGRAPH"
"FIRST_SENTENCE"

9.3.3.4.1 FIRST_PARAGRAPH

This example returns the first <Text> element that exceeds one-half the average <Text> element paragraph size. Note that the <Title> element does not match the key value, so it is ignored for both the search and for the average length calculation.

Content: <Title>Poem</Title>

<Text>Mary had</Text>

<Text>a little Lamb</Text>

<Text>The fleece was white as snow</Text>

<Text>And everywhere that Mary went the lamb was sure to go</Text>
Rule: FIRST_PARAGRAPH
Key: Text
Count: 50
Returns: The fleece was white as snow.

9.3.3.4.2 FIRST_SENTENCE

This example returns the first sentence of the first two <Text> elements. Note that the<Title> element does not match the key value, so it is excluded from the search.

Content: x<Title>Barefoot in the Park</Title>

<Text>See Dick run. See Jane run. See Dick and Jane.</Text>

<Text>See Spot run. See Puff chase Spot.</Text>

<Text>See Dick chase Spot and Puff.</Text>
Rule: FIRST_SENTENCE
Key: Text
Count: 2
Returns: See Dick run. See Spot run.

9.3.4 Option List Search Rule

The Option List search rule, named OPTION_LIST, looks for keywords within the source document, applies a score for each keyword found, and returns the option list value that has the highest keyword score. For example, if the keywords margin, SEC filing, or invoice were found in a document, the suggested value for the Department field would be Accounting, while the keywords tolerance, assembly, or inventory would return Manufacturing as the suggested value.

The Option List search rule will usually be applied to metadata fields that have an option list defined in the Configuration Manager. See the Content Server online help for information on creating option lists for custom metadata fields.
Option list names and values (called categories in Content Categorizer) appear in Content Categorizer as specified in the Configuration Manager. If you create or change a custom option list field while the CC Admin Applet is open, you will need to close and reopen the applet to see the changes.
The current version of Content Server automatically inserts a blank value as the default value in a custom option list field. In this case, the first value (by default, a blank value) will not be considered a user-entered value, and the Option List search rule will be applied. If you do not want the Option List search rule to override the first value in a custom option list field, you must provide a default value for that option list on the Configuration Manager Applet.

This section covers the following topics:

"Rule Types"
"Key"
"Count"
"Examples"

9.3.4.1 Rule Types

There is one type of Option List search rule, which searches for keywords (single words or phrases) that exactly match the keywords defined in the key.

Keywords may be single words (for example, dog) or multiple-word phrases (for example, black dog).
Keywords may use the following defined set of operators to further refine a search:
- $$AND$$
- $$OR$$
- $$AND_NOT$$
- $$NEAR$$
Keywords are pre-assigned to each category (value) in the option list, and each keyword has a weight assigned to it. See "Defining Option List Keywords".
The number of occurrences of each keyword found in the document is multiplied by its weight, resulting in a keyword score.
The keyword scores for each category are added together, resulting in a category score.
The category with the highest score is returned as the suggested value.
If there is a tie between categories, the category earliest in the option list is returned as the suggested value.
The weights Always and Never can be used to override the scores and count threshold.
- An occurrence of a keyword with the Always weight forces the category to be returned as the suggested value, regardless of score.
- An occurrence of a keyword with the Never weight disqualifies the category from being returned as the suggested value, regardless of score.
- If two categories have keywords that are assigned the Always weight, and both keywords occur in the document, the keyword first found in the document takes precedence.
  
  Important:
  Option List searches are case sensitive and must match exactly. For example, Invoice, Invoices, invoice, and invoices must be defined to retrieve all instances of this keyword.

9.3.4.2 Key

The key for an Option List search rule is the Option List name, as shown on the Option Lists tab of the CC Admin Applet.

9.3.4.3 Count

The count for an Option List search rule sets a minimum threshold score for the rule to return results. For example, if the count is set to 50, and the highest accumulated keyword score is 45, the rule fails.

9.3.4.4 Examples

This section provides examples about the Option List search rule.

Example 1

In this example, the score for Dick and Spot is 30 (3 occurrences x 10), and the score for Jane and Puff is 20 (2 occurrences x 10). Dick is returned as the suggested value because it is earlier in the option list than Spot:

Content: <Title>Barefoot in the Park</Title>

<Text>See Dick run. See Jane run. See Dick and Jane.</Text>

<Text>See Spot run. See Puff chase Spot.</Text>

<Text>See Dick chase Spot and Puff.</Text>
Rule: OPTION_LIST
Key: MainCharacter
Count: 10
Option List Categories, Keywords, and Weight: Dick: Dick=10, boy=5, Richard=2

Jane: Jane=10, girl=5, Janie=2

Spot: Spot=10, dog=5

Puff: Puff=10, cat=5
Returns: Dick

Example 2

In this example, Spot is returned as the suggested value because its score of 60 (3 occurrences x 20) is higher than the other categories:

Content: <Title>Barefoot in the Park</Title>

<Text>See Dick run. See Jane run. See Dick and Jane.</Text>

<Text>See Spot run. See Puff chase Spot.</Text>

<Text>See Dick chase Spot and Puff.</Text>
Rule: OPTION_LIST
Key: xMainCharacter
Count: 10
Option List Categories, Keywords, and Weight: Dick: Dick=10, boy=5, Richard=2

Jane: Jane=10, girl=5, Janie=2

Spot: Spot=20, dog=10

Puff: Puff=10, cat=5
Returns: Spot

Example 3

In this example, the rule fails because none of the scores is above the Count threshold of 50:

Content: <Title>Barefoot in the Park</Title>

<Text>See Dick run. See Jane run. See Dick and Jane.</Text>

<Text>See Spot run. See Puff chase Spot.</Text>

<Text>See Dick chase Spot and Puff.</Text>
Rule: OPTION_LIST
Key: MainCharacter
Count: 50
Option List Categories, Keywords, and Weight: Dick: Dick=10, boy=5, Richard=2

Jane: Jane=10, girl=5, Janie=2

Spot: Spot=10, dog=5

Puff: Puff=10, cat=5
Returns: Fail

Example 4

In this example, Puff is returned as the suggested value because the keyword "Puff" has a weight of Always:

Content: <Title>Barefoot in the Park</Title>

<Text>See Dick run. See Jane run. See Dick and Jane.</Text>

<Text>See Spot run. See Puff chase Spot.</Text>

<Text>See Dick chase Spot and Puff.</Text>
Rule: OPTION_LIST
Key: MainCharacter
Count: 10
Option List Categories, Keywords, and Weight: Dick: Dick=10, boy=5, Richard=2

Jane: Jane=10, girl=5, Janie=2

Spot: Spot=10, dog=5

Puff: Puff=Always, cat=5
Returns: Puff

9.3.5 Categorization Engine Search Rule

The Categorization Engine search rule, named CATEGORY, uses a 3rd-party categorizer engine and defined taxonomy to determine and return a value that represents a category within the specified taxonomy, for example, News/Technology/Computers.

This section covers the following topics:

"Rule Types"
"Key"
"Count"

9.3.5.1 Rule Types

There is one type of Categorization Engine search rule, which uses the categorizer engine and taxonomy specified in the Key to return a value for the field.

9.3.5.2 Key

The key for a Categorization Engine search rule is the name of the categorizer engine followed by the name of the taxonomy. For example, EngineName/TaxonomyName.

If you do not specify an engine name in the Key field, Content Categorizer defaults to the first engine displayed in the Categorizer Engines list. Therefore, if you have defined only one engine, you would only need to enter the taxonomy name in the Key field.

9.3.5.3 Count

The count for a Categorization Engine search rule sets a minimum confidence level threshold for the returned results.

When a categorization engine returns a category (or set of categories) for a given query, a confidence level is also returned, which is often expressed as a percentage for each category. The Category rule always accepts the highest-confidence category, unless the confidence level is below the count value specified for the rule, in which case the rule fails. For example, if the count is set to 50, and the highest-confidence category returned is 45, the rule fails.

The default count of 1 would always accept the highest-confidence category returned by the categorizer engine.

The actual range for the Count value depends on the categorizer engine that is being used.

9.3.6 Filetype Search Rule

The Filetype search rule, named FILETYPE, looks at the filename extension of a document and returns a term, usually a file type description associated with the filename extension.

This section covers the following topics:

"Rule Types"
"Key"
"Count"
"Examples"

9.3.6.1 Rule Types

There is one type of Filetype search rule, which uses the filename extension of the primary (native) file to return a value for the field.

When the Filetype search rule is defined for a metadata field, the filename extension of the content item is matched against all values in the Content Server's DocFormatsWizard table. This table is found in the file doc_config.htm, which is located in the IntradocDir/shared/config/resources/ directory.

If a match is found, the associated value in the Description column is extracted and translated. The resulting string is returned as the suggested metatdata value for the field.

If the primary file path has no extension, or if the extension does not match any of the "extensions" values in the DocFormatsWizard table, the rule fails and the next rule in the list for the metadata field is executed.

9.3.6.2 Key

The key for a FILETYPE search rule is not used when determining a metadata value. The Key field should be left blank.

9.3.6.3 Count

The count for a FILETYPE search rule is not used when determining a metadata value. The Count field should be left blank.

If a FILETYPE rule is created with non-blank Key or Count fields, a warning message is displayed indicating that these fields are not supported by the rule.

9.3.6.4 Examples

This section provides examples about the Filetype Search rule.

Example 1

Primary File: policies.doc
Rule: FILETYPE
Key: blank
Count: blank
Returns: Microsoft Word Document

Example 2

Primary File: procedures.wpd
Rule: FILETYPE
Key: blank
Count: blank
Returns: Corel WordPerfect Document

9.3.7 Defining Search Rules

This section covers the following topics:

"Defining Search Rules"
"Defining Option List Keywords"
"Applying Rules to the Type Field"

9.3.7.1 Defining Search Rules

During Content Server startup, Content Categorizer takes a snapshot of the current metadata field configuration including field names and lengths. If your metadata field configuration changes, you must restart Content Server before running the Content Categorizer Admin Applet to add or modify any search rules.

To define search rules for any metadata field:

Log into the Content Server as the system administrator.
Click the Administration link.
Click the Content Categorizer Administration link (under Administration Pages for instance_name).

The Content Categorizer Admin Applet Page is displayed.
Click the Rule Sets tab.
Click on the Ruleset drop-down list and select the desired ruleset, or click Add to add a new ruleset.
Select a metadata field from the Field choice list.
Click Add.

The Add/Edit Rule for Field Screen is displayed.
Select the rule type from the Rule choice list.
Enter the search rule key in the Key field. For an OPTION_LIST search rule, keywords for the option list must be defined on the Option List tab. See "Defining Option List Keywords".
Enter the count in the Count field.
Click OK.
Add search rules to each metadata field as desired.
- To delete a rule, select the rule in the Rules List and click Delete.
- To edit a rule, select the rule in the Rules List and click Edit.
- To adjust the order of the rules, select the rule in the Rules List and click Move Up or Move Down. Rules are applied in the order listed. If the first rule succeeds, no other rules are applied. If the first rule fails, then the next rule is applied, and so forth.
  
  Important:
  If you have added, edited, or deleted a CATEGORY rule, a dialog will prompt you to apply the changes and build, rebuild, or check for orphaned query trees for this rule on the Query Trees tab.
Click Apply to save the changes, or click OK to save the changes and close the CC Admin Applet screen.

9.3.7.2 Defining Option List Keywords

To define the keywords and weights for an option list:

Log into the Content Server as the system administrator.
Click the Administration link.
Click the Content Categorizer Administration link (under Administration Pages for instance_name).

The Content Categorizer Admin Applet Page is displayed.
Click the Option Lists tab.
Select an option list from the Option List choice list.

Caution:
When an option list metadata field is deleted from the Configuration Manager, the field is removed from the Rule Sets tab, but it still appears in the Option List choice list on the Option Lists tab. Be careful not to select an obsolete option list.
Select a value from the Category choice list.
Enter a keyword or phrase in the Keyword field. Option List searches are case sensitive and must match exactly.
- Keywords may be single words or multiple-word phrases.
- Keywords may include Boolean-type expressions, where the following set of binary operators are valid: $$AND$$, $$OR$$, $$AND_NOT$$, $$NEAR$$
Select a weight for the keyword.
- Always = If the keyword is found, the selected category will be returned as the suggested value, regardless of the score.
- Weight = This number multiplied by the number of occurrences of the keyword is the category's score. The category with the highest score is returned as the suggested value for the option list field.
- Never = If the keyword is found, the selected category will not be returned as the suggested value, regardless of the score.
Click Add.
Enter keywords for each category in the selected option list.
- To delete a keyword, select the keyword in the Keywords list and click Delete.
- To edit a keyword, select the keyword in the Keywords list, click Edit, edit the keyword and/or the weight, and click Update.
Click Apply to save the changes, or click OK to save the changes and close the CC Admin Applet screen.

9.3.7.3 Applying Rules to the Type Field

You can edit the content server configuration file so that Content Categorizer ignores the Type default value and applies search rules to the Type field.

This procedure applies only to the Type (dDocType) field. Search rules cannot be applied to the other standard option list fields (Security Group, Author, and Account).

To apply search rules to the Type field:

Open the config.cfg file located in the IntradocDir/config/ directory in a text-only editor such as WordPad.
Add the following line to the file:
```
ForceDocTypeChoice=true
```
Save and close the file.
Stop and restart the Content Server.

9.4 Sample doc_config.htm Page

The following is a sample doc_config.htm page.

<@table DocFormatsWizard@>

dFormat	Extensions	dConversion	dDescription
application/ corel-wordperfect, application/wordperfect	wpd	WordPerfect	apWordPerfectDesc
application/ vnd.framemaker	fm	FrameMaker	apFramemakerDesc
application/ vnd.framebook	bk, book	FrameMaker	apFrameMakerDesc
application/vnd.mif	mif	FrameMaker	apFrameMakerInterchangeDesc
application/lotus-1-2-3	123, wk3, wk4	123	apLotus123Desc
application/lotus-freelance	prz	Freelance	apLotusFreelanceDesc
application/lotus-wordpro	lwp	WordPro	apLotusWordProDesc
application/msword, application/ms-word	doc, dot	Word	apMicrosoftWordDesc
application/vnd.ms-excel, application/ms-excel	xls	Excel	apMicrosoftExcelDesc
application/ vnd.ms-powerpoint, application/ms-powerpoint	ppt	PowerPoint	apMicrosoftPowerPointDesc
application/vnd.ms-project, application/ms-project	mpp	MSProject	apMicrosoftProjectDesc
application/ms-publisher	pub	MSPub	apMicrosoftPublisherDesc
application/write	wri	Word	apMicrosoftWriteDesc
application/rtf	rtf	Word	apRtfDesc
application/vnd.visio	vsd	Visio	apVisioDesc
application/vnd.illustrator	ai	Illustrator	apIllustratorDesc
application/vnd.photoshop	psd	PhotoShop	apPhotoshopDesc
application/vnd.pagemaker	p65	PageMaker	apPageMakerDesc
image/gif	drw, igx, flo, abc, igt	iGrafx	apiGrafxDesc
text/postscript	ps	Distiller	apDistillerDesc
application/hangul	hwp	Hangul97	apHangul97Desc
application/ichitaro	jtd, jtt	Ichitaro	apIchitaroDesc
image/graphic	gif, jpeg, jpg, png, bmp, tiff, tif	ImageThumbnail	apThumbnailsDesc
image/application	txt, eml, msg	NativeThumbnail	apNativeThumbnailsDesc

<@end@>
<@table PdfConversions@>

dFormat	Extensions	dConversion	dDescription
application/pdf	pdf	PDFOptimization	apPdfOptimization
application/pdf	pdf	ImageThumbnail	apPdfThumbnailsDesc

<@end@>

9.5 XSLT Transformation

Content Server uses a two-step process for categorizing content. The first step translates content into an XML format, the second step transforms the XML file into another XML file useful to Content Categorizer. The process is transparent in that the original content is not modified, and both the translated and transformed XML files are discarded after use.

This section covers the following topics:

"Translation"
"Transformation Using XSLT Stylesheets"
"SearchML Transformation"
"Flexiondoc Transformation"
"Example Files"

9.5.1 Translation

The translation step uses the OutsideIn XML Export filters to output the XML in either SearchML or Flexiondoc XML format, depending on the type of content being translated and whether the format is available for the platform being used. This translation process enables Categorizer to support a large number of different source document formats.

The transformation step uses eXtensible Stylesheet Language Transformations (XSLT) to transform the initial XML output into an XML equivalent that can be easily searched and analyzed by Content Categorizer, based on search rules defined by the user.

An overview of the transformation process may be useful to anyone interested in the categorization process, and serve as a starting point for users who would like to define their own XSLT stylesheets to accommodate their specific document processing needs.

Translation Using OutsideIn XML Export Filters

A runtime version of the OutsideIn XML Export product is integrated and installed with Content Server, and it filters content checked in for categorization. The Export filters convert content to XML for transformation using Categorizer's XSLT stylesheets. The transformation is necessary because the Export XML schemas, Flexiondoc and SearchML, are not in a form easily searched by Content Categorizer rules.

9.5.2 Transformation Using XSLT Stylesheets

Two stylesheets are included with Content Categorizer and applied based on the initial translation format provided by the OutsideIn XML Export filter. The stylesheets are located in the following directory.

/<cs_root>/data/contentcategorizer/stylesheets/

For content items output in SearchML, searchml_to_scc.xsl is applied. For content items output in Flexiondoc, flexiondoc_to_scc.xsl is applied. SearchML and Flexiondoc both reproduce style designations found in the source content, but they do so differently, in ways not detectable by Content Categorizer rules. The appropriate stylesheet can recognize the necessary style information in each format and use that information as the basis for transforming the final output tags into an XML document useful to Content Categorizer.

The similarity between SearchML and Flexiondoc depends on the degree to which internal styles or metadata are used in the content. When working with content using named styles, such as Microsoft Word, the resultant output will be similar. When working with content in formats such a PDF or text, results come out with more generic tagging.

Important:

There is a problem with the XSLT transformation used to post-process PDF content that is output in Flexiondoc format. When Flexiondoc is used, single words are assigned to individual XML elements, making the final XML unsuitable for most Categorizer search rules. It is therefore recommended that you use SearchML for categorizing PDF content.

9.5.3 SearchML Transformation

When the OutsideIn XML Export filter translates content into SearchML XML format, it identifies the properties of the content item, such as title, subject, and author, and tags them as a <doc_property> element. It distinguishes the properties by a type attribute. It also identifies document text and tags it as a <p> element. It distinguishes styles within text by an s attribute.

Document Properties and Text Style Examples

For example, using the Wellington_WordStyle.doc example found in the IntradocDir/custom/ContentCategorizer/CC_Sample/ directory, the file's author property, "Duke of Wellington," is tagged in the SearchML XML output as:

<doc_property type="author">Duke of Wellington</doc_property>

The first paragraph of the item, listing the date, would be tagged as:

<p>Date: August 24, 1812</p>

Note that no style attribute is defined.

Applying the searchml_to_scc.xsl stylesheet to the translated XML file searches the XML for all <doc_property> tags and uses the type attribute as the suffix for the transformed output tag used as a key in a Content Categorizer rule.

For example, the following code in the searchml_to_scc.xsl stylesheet would take the tag:

<doc_property type="author">Duke of Wellington</doc_property>

and output

<scc_author>Duke of Wellington</scc_author>:

<xsl:template match="sml:doc_property[@type]">
    <xsl:variable name="typeValue">
        xsl:value-of select="@type"/>
    </xsl:variable>
    <xsl:element name="scc_{translate($typeValue, $translateFrom, $translateTo)}">
        <xsl:value-of select="."/>
    </xsl:element>
</xsl:template>

Similarly, the searchml_to_scc.xsl stylesheet also causes the XML file to be searched for all <p> tags and uses the s attribute as the suffix for the transformed output tag used as a key in a Content Categorizer rule. Where no style attribute is defined, the transformation passes the <p> tag through.

9.5.4 Flexiondoc Transformation

When the OutsideIn XML Export filter translates content into Flexiondoc XML format, it identifies the properties of the content item, such as title, subject, and author, and tags them as a <doc_property> element, just like SearchML. However, it distinguishes the properties by a name attribute, instead of type.

This section covers the following topics:

"Document Properties Example"
"Text Style Example"

9.5.4.1 Document Properties Example

<doc_property name="author">Duke of Wellington</doc_property>

Applying the flexiondoc_to_scc.xsl stylesheet to the translated XML file searches the XML for all <doc_property> tags and uses the name attribute as the suffix for the transformed output tag used as a key in a Content Categorizer rule.

For example, the following code in the flexiondoc_to_scc.xsl stylesheet would take the tag,

<doc_property name="author">Duke of Wellington</doc_property>

and output

<scc_author>Duke of Wellington</scc_author>:

<xsl:template match="fld:doc_property">
  <xsl:variable name="propName">
    <xsl:choose>
       <xsl:when test="@name">
         <xsl:value-of select="@name" />
         </xsl:when>
         <xsl:when test="@user_defined_name">
         <xsl:value-of select="@user_defined_name" />
         </xsl:when>
         <xsl:otherwise>NAMELESS_DOC_PROPERTY_WITH_ID_<xsl:value-of select="@id" /></xsl:otherwise>
        </xsl:choose>
    </xsl:variable>
    <xsl:element name="scc_{translate($propName, $translateFrom, $translateTo)}">
       <xsl:value-of select="." />           
    </xsl:element>
</xsl:template>

9.5.4.2 Text Style Example

Where Flexiondoc differs from SearchML is in how it identifies styles. Paragraph styles are tagged with <tx.p> tags, and character styles are tagged with <tx.r> tags, but each have an attribute based on a unique style id, in addition to a name attribute.

All styles are defined in child elements of the <style_tables> element of the Flexiondoc XML file, and given an id attribute, which is called when referencing the style, and which the template file uses to define a style key with a name attribute.

For example, in the Flexiondoc XML output of the Wellington_WordStyle.doc example, character styles are defined in the <tx.char_style_table> child of the <style_tables> parent element. Notice the id attribute:

<tx.char_style_table>
    <tx.char_style id="ID16d" auto_kern_above="0.1111in" auto_kerning="false" back_brush="ID168" font="ID16f" kerning="0in" text_brush="ID16e" text_effect="normal" text_hidden="false" text_position="normal" text_protected="false" text_strikethrough="none" underline="ID170"/>
    <tx.char_style id="ID178" back_brush="ID176" font="ID177" text_brush="ID176"/><tx.char_style id="ID187" font="ID186"/><tx.char_style id="ID1d6" font="ID1d5"/>
    <tx.char_style id="ID1e5" font="ID1e4"/>
    <tx.char_style id="ID1e8" name="Default Paragraph Font" predefined="default"/>
    <tx.char_style id="ID1ec" font="ID1eb"/>
</tx.char_style_table>

When the flexiondoc_to_scc.xsl stylesheet is applied, it causes the output XML file to be searched for all character styles, <tx.char_style>. It uses the id attribute of the style to define unique <xsl:key> elements with a name attribute based on the id of each <tx.char_style> tag:

<xsl:key name="charStyleKey" match="fld:tx.char_style" use="@id" />
<xsl:template match="fld:tx.r[@style]">
  <xsl:variable name="charStyleName">
    <xsl:value-of select="key('charStyleKey', @style)/@name" />
  </xsl:variable>
  <xsl:choose>
    <xsl:when test="string-length($charStyleName) &gt; 0">
      <xsl:element name="scc_{translate($charStyleName, $translateFrom, $translateTo)}">
       <xsl:apply-templates />              
      </xsl:element>
  </xsl:when>
  <xsl:otherwise>
          <xsl:value-of select="." />
      </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Similarly, when the stylesheet is applied, it causes the output XML file to be searched for all paragraph styles, <tx.para_style>. It then uses the id attribute of the style to define unique <xsl:key> elements with a name attribute based on the id of each <tx.para_style> tag:

<xsl:key name="paraStyleKey" match="fld:tx.para_style" use="@id" />
<xsl:template match="fld:tx.p[@style]">
  <xsl:variable name="styleValue">
    <xsl:value-of select="@style" />
  </xsl:variable>
  <xsl:variable name="paraStyleName">
    <xsl:value-of select="key('paraStyleKey', $styleValue)/@name" />
  </xsl:variable>
  <xsl:choose>
    <xsl:when test="string-length($paraStyleName) &gt; 0">
      <xsl:element name="scc_{translate($paraStyleName, $translateFrom, $translateTo)}">
<xsl:apply-templates />              
      </xsl:element>
  </xsl:when>
  <xsl:otherwise>
    <xsl:element name="p" >
<xsl:value-of select="." />
</xsl:element>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

9.5.5 Example Files

For more detailed study of examples, sample files are located here:

IntradocDir/data/components/ContentCategorizer/CC_Sample

For more detailed study of the XSLT style sheets, they are located here:

IntradocDir/data/components/ContentCategorizer/stylesheets

The stylesheets located in the directory listed above are used by Content Categorizer. Make duplicates for study.