Cognos8 Help: Integrate InfoSphere Guardium Data Redaction with IBM Classification Module

InfoSphere Guardium Data Redaction is a product aimed at achieving a balance between openness and privacy. Often, the same regulations require organizations to share their documents with regulators, business partners, or customers, and at the same time to protect sensitive information which may be buried in these documents. With thousands of document in Enterprise Content Management systems such as IBM FileNet ® and IBM Content Manger®, automation combined with a well-structured workflow is essential for practically controlling access to private information in documents at a fine grain.
For example, in eDiscovery, lawyers must share documents with the opposing lawyer adversaries. But lawyers do not want to release any information they don't need to, and attorney-client privileged information must be carefully protected. Similarly, The Freedom of Information Act (FOIA) is intended to hold government organizations more accountable for their actions by making information about those actions available on demand. However, individuals are not entitled to access sensitive personal information. But on the other hand, the same regulation requires that those ordering the documents must not see any sensitive personal or national security information embedded in documents that might be made public.
InfoSphere Guardium Data Redaction product automatically finds and deletes sensitive text within a document, redacting the document. It then outputs the redacted document in a format such as a PDF. Alternatively, the product includes a web-based Secure Viewer for even more control over the release of private information. Each user sees just what they are allowed to see. In some cases, even if a user is allowed to see some information, it is withheld unless they ask for it, specifying the reason for their need to know.
Within an organization, not all documents contain sensitive data. For the redaction to be effective, it is critical that relevant documents be identified. InfoSphere Guardium Data Redaction is capable of identifying and redacting many types of personally identifiable information, but not all occurrences constitute sensitive data. The sensitivity of the entities is often dependent on context. For example, names of medical procedures in administrative documents catalog are not sensitive, but in patient records they are. IBM Classification Module is capable of identifying sensitive documents containing data that requires redaction.
The level of sensitivity varies across documents of different types. A group of documents from one department within the organization may require a customized redaction policy. Other groups of documents may have been created for public consumption, and it can be assumed that these documents contain no sensitive data. These document groupings may or may not be part of a formalized classification system.
Below is an example of a sensitive document, and its redacted version. Personal names, addresses, account and telephone numbers have been removed.

Figure 1. An overview of the redaction process
Shows original document, sensitive information removed, and resulting document

Shows original document, sensitive information removed, and resulting document

There are different formats available for the redacted version of the document. In addition to the usual formats (PDF, Microsoft Word document, TIFF, text, and so on), a propriety format is available that can be viewed by the Secure Viewer (an application shipped with InfoSphere Guardium Data Redaction).
IBM Classification Module is capable of identifying documents according to a large range of criteria, including statistical classification and rule-based decisions. The implementation involves these stages:

Create a knowledge base and train it using user-defined groups of sample documents.
Create a decision plan that will:
- Categorize new documents based on the knowledge base results.
- Move documents to relevant folders.
Run the Classification Module Classification Center using the created decision plan. Documents are moved to relevant folders.
Run redaction batch processes on the repository folders. Redacted versions of the document are created; original copies are kept.

The implementation described here involves documents stored in a file system. Both Classification Module and InfoSphere Guardium Data Redaction are capable of accessing and processing documents on IBM FileNet and IBM Content Manager systems.
The workflow described here uses IBM Classification Module's Classification Center to classify documents into a taxonomy tree.
For information on how to create a knowledge base and decision plan, and to set up Classification Center for classifying documents into folders, see the IBM Classification Module Information Center
Guardium Data Redaction then redacts documents in two different category folders, nested within the Repository Folders, according to two different redaction policies. Guardium Data Redaction uses a specific folder structure (repository folders) which serves as the basis for its data processors.
The workflow described here involves these steps:

Set the configuration for redaction: Configure two processors in InfoSphere Guardium Data Redaction.
Start the Data Redaction server in order to create the relevant processors and their repository folders.
Create the Classification Module knowledge base and decision plan.
Run the Classification Module Classification Center to move the documents to the redaction in folders.
Restart the InfoSphere Guardium Data Redaction server to redact documents and move them to the appropriate folders for further processing.

Set the configuration for redaction
Before running the Classification Module Classification Center or InfoSphere Guardium Data Redaction, the processors should be set up.
Configure two repositories
Two separate processors (Legal and IBM Global Financing) are defined in two processor configuration files found in the IBM\GuardiumDataRedaction\server\conf folder.
Each processor has one configuration file named in the IBM\GuardiumDataRedaction\server\conf\plugins.xml file:

Listing 1. Sample processor setup in plugins.xml


 com.ibm.nex.redaction.docrepository.SimpleFilesDocumentRepository
 
        batchFileSystemProcessorIBM_Legal.xml

 com.ibm.nex.redaction.docrepository.SimpleFilesDocumentRepository
 
 batchFileSystemProcessorIBM_Finance.xml

Each XML configuration file contains the following settings:

The base folder for the repository This folder should match the directory used by Classification Center, for example:
c:/data/IBM Products CC Output Folder
Repository folder name The folder name should match exactly the associated category name in the Classification Module knowledge base.

Setting different data policies
We will set two policies:

Legal role: US dollar amounts are redacted.
Financial role: Organization names are redacted.

These profiles are configured in the XmlPolicyModel.xml file in IBM\GuardiumDataRedaction\server\conf
Each ns21:permission element maps one role with one category. The ns21:redact element sets this as a redacted category. The categories are mapped in the within the same file.
Below, each user has one redacted category. Each mapping maps a single user to a single category. The user role (userRoleID) and category (semanticCategoryId) are configured elsewhere in the same file. Here, each category is set to redacted.

Listing 2. Legal role

Listing 3. Financial role

Start the InfoSphere Guardium Data Redaction server
From the IBM InfoSphere Guardium Data Redaction Windows menu, choose Start server. This will start the server and create the configured repositories. You can optionally stop the server in order to prevent it from processing the files created by the Classification Center before you have checked them. If the in folder becomes populated while the Data Redaction server is running, these files will be picked up for processing.

Create the Classification Module knowledge base and decision plan
Classification Module Classification Center is capable of copying and/or moving files within a file system and reading/modifying metadata associated with a document within a full content management system. These actions are based on a series of decisions made within a decision plan running on the Classification Module server. Although this decision plan takes actions based on triggers, these rules can consider results from statistic analysis of the document content returned by the knowledge base (also running on the server). The knowledge base typically assigns a category to the document, based on statistical similarities.
For details on how to create a knowledge base and decision plan, see the Classification Module InfoCenter Workbench topic in the Information Center, accessible from the Resources section.
Create the knowledge base
Classification Module Workbench is shipped with a project called IBM Products. This project contains the basis for the knowledge base used here. The following figures shows the list of categories.

Figure 2. The IBM Products knowledge base
Explorer view of the knowledge base The IBM Products Knowledge Base

Explorer view of the knowledge base The IBM Products Knowledge Base

The knowledge base structure mimics the target folder structure. The following figure shows the folder structure, each folder named after a category.

Figure 3. The folder structure
opened up explorer view of the folder structure for organizing classified documents

opened up explorer view of the folder structure for organizing classified documents

Create the decision plan
The decision plan includes a set of rules. Below is an example of a rule that moves documents to the target folders based on the highest category match (for an example of such rules, see the Rules for File System project in Classification Module Workbench).

Figure 4. The decision plan (first rule)
The rule for matching the document against the knowledge base.

The rule for matching the document against the knowledge base.

The folders that will be redacted are a special case. The figure below shows an action for moving the document to the in subfolder within a redaction repository:

Figure 5. The decision plan (second rule)
The rule for moving files to the correct repository folder.

The rule for moving files to the correct repository folder.

Run Classification Module Classification Center
For details on how to set up Classification Center for classifying documents into folders, see InfoSphere Classification Module InfoCenter Classification Center topic.
Once Classification Center is run, the documents for redaction should be moved to the redaction in folders; non-redacted documents should be moved to the Products subcategories within this structure. The figure below shows the in folder for two repositories and other non-repository folders named after categories.

Figure 6. The redaction repository file structure

The redaction repository file structure, the Classification Center inserts documents into the input directories.

Check to see that the above folders were populated by Classification Center.
The following figure shows two folders (Financial and Legal) that will serve also as Data Repository folders:

Figure 7. The Financial and Legal repository folders
Explorer view of the Financial and Legal repository folders

Explorer view of the Financial and Legal repository folders

Here, Classification Center moves files to the subfolder in of each repository folder.

Restart the InfoSphere Guardium Data Redaction server
From the IBM InfoSphere Guardium Data Redaction Windows menu, choose Start server. Since the in folder of the two new repositories now contain the documents created by the Classification Center, Redaction will now process these files.
The figure below shows the orig and out folders within each repository structure.

Figure 8. The out folder now contains redacted documents. The orig folder contains the original copies.
The out folder now contains redacted documents. The orig folder contains the original copies.

The out folder now contains redacted documents. The orig folder contains the original copies.

Data redaction processes documents from the in folder and creates redacted and non-redacted versions in the respective folders:
orig folders: original documents
out folders: redacted copies
The percentage of files that are sent for review depends on the percentage set in the relevant repository file (such as batchFileSystemProcessorIBM_Legal.xml above):
0
We now have various versions, redacted and non-redacted, of our original documents classified into folders. There are various aspects of this model that can be adapted according to business needs.

Some ideas for varying the model
Finding sensitive documents for redaction without subject classification
In the case where the only goal is to locate sensitive data, there is no need for conventional content classification. In this case a Classification Module knowledge base can be created that recognizes the nature of the sensitive documents, and the decision plan can be used to move only those documents to the Redaction repository folder. There is no need for a folder dedicated to CC output. Because the 2-category knowledge base is often used for finding a few relevant items within a large content set, this method is often called "pinpointing." However, it can be used also for finding a large group of similar documents among non-relevant documents.

Figure 9. The pinpointing knowledge base
A two-category knowledge base for finding sensitive material within a large collection of documents.

A two-category knowledge base for finding sensitive material within a large collection of documents.

To create such a knowledge base, choose a number of sensitive documents and an equal number of non-sensitive documents.
Adding manual review of the Classification Center output before and/or after redaction
The Classification Center can be used to manually review documents before they are sent for redaction.
This method can be used early on when the system is first put into production when knowledge base confidence may be low. In addition feedback can be submitted to improve the knowledge base.
The Redaction Manager can be used to review documents, after they are classified and redacted. The document redaction can be edited or removed and sent to another Repository Folder for redaction according to a different policy.
Using multiple pinpointing knowledge bases
Multiple knowledge bases could be set up for pinpointing specific documents for redaction. One or more processes could be implemented consecutively according to need, until all documents are moved to a folder for redaction. This would be helpful, for example, in the case where new sensitive documents of a different nature need to be located for redaction, or where the nature of new documents changes.

Resources
Learn

Learn more about InfoSphere Classification Module in the InfoSphere Classification Module Information Center.
For details on how to create a knowledge base and decision plan, see the InfoSphere Classification Module InfoCenter Workbench topic..
For details on how to set up Classification Center for classifying documents into folders, see the InfoSphere Classification Module InfoCenter Classification Center topic.
Learn more about Guardium and the redaction process in the article "Integrate a document data redaction process in your business workflow using IBM InfoSphere Guardium Data Redaction" (developerWorks, Sep 2011).
Get the resources you need to advance your skills on IBM InfoSphere products in the InfoSphere section on developerWorks.
Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
Follow developerWorks on Twitter.
Learn more about Information Management at the developerWorks Information Management zone. Find technical documentation, how-to articles, education, downloads, product information, and more.
Stay current with developerWorks technical events and webcasts.
Follow developerWorks on Twitter.

Get products and technologies

Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Participate in the discussion forum.
Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Check out the developerWorks blogs and get involved in the developerWorks community.

About the author

Jane Singer is on the QA teams for both InfoSphere Guardium Data Redaction and InfoSphere Classification Module at the IBM Israel Software Lab. In addition she leads L3 and presales support for InfoSphere Classification Module.

Purchase your Section 508 Compliance Support guide now!

Integrate InfoSphere Guardium Data Redaction with IBM Classification Module