Data Categorization

Once a baseline identity inventory has been created, the next step is to categorize the data. We do that using three different techniques. The first is called the identity data audit and asks additional questions about the data in the inventory. The second technique creates an identity map that specializes the identity lifecycle for each data source. The final technique is a process-to-identity matrix that helps us to easily see which identities support which processes.

The purpose of the identity data audit is to answer additional questions about the baseline identity inventory. Identity data audits should be done periodically, say, once a year, as part of maintaining the IMA. During these audits, the inventory is updated and additional data is gathered that is useful in managing the identity data, doing risk assessments for security and privacy purposes, and protecting against loss.

Someone other than the owner or custodian of the identity data should do the audit, although it might be done on the owner's behalf. In a large organization, the CIO's office would be responsible for performing the audit. The audit consists of gathering information through a series of questions and then evaluating the information gathered. The questions we ask are very similar to the kinds of questions we discussed for privacy audits in Chapter 4. A privacy audit can easily be folded into the identity data audit, but be sure to include a specific evaluation of privacy in the results.

Here are some of the questions you might want to ask:

  • What is the purpose of collecting this data?

  • How is this data being collected?

  • Were special conditions on its use, such as privacy policies or non-disclosure agreements, established at any time?

  • Who are the data owner and custodian, if not recorded in the inventory?

  • Who else can make changes to or administer the data?

  • Who uses the data, why, and how do they usually access it (i.e., remotely, via the Web, from home, etc.)?

  • Where and how is it stored?

  • Is the data dependent on other sources of identity data (such as another directory)? If so, who owns those stores?

  • Is this data record the canonical or authoritative source for this information? If not, which data record is?

  • What is the schema for the data?

  • Is there a related domain model?

  • How large are the records?

  • How many records are stored? What is the maximum expected?

  • What are the data transactions (SQL or other queries) routinely performed on the data store?

  • Is any of the data stored on devices that are routinely transported off-site such as a laptop or PDA?

  • Are there backups? If so, you need to answer these same questions about the backups.

  • How critical is the data to the business?

  • What are the tolerances for data loss or corruption?

  • Where and how is encryption used in the record?

  • Is the data synchronized to another repository or regularly accessed through a bulk data transfer? If so, you need to answer these same questions about any other repository for this data.

  • Are their access logs for the data?

  • Where are the logs stored?

  • Are the logs protected and who has access and administrative rights to the logs?

  • What other security measures (firewalls, intrusion detection systems, and so on) are used to protect the data?

  • Who administers those systems?

These are only example questions. You may think of others that need to be asked that are specific to your organization or the data in the inventory. Using the results of the audit, you should evaluate the maturity level of this identity data. Don't hesitate to assign different maturity levels to different aspects of the identity data. The results of the audit should be shared with the owner and custodian of the identity data so that they can offer feedback and correct misperceptions or mistakes.

As you conduct the audits, you should be on the lookout for the following kinds of issues:

  • Identity data that is never used

  • Identity data that is owned by someone who doesn't use it

  • Owners at the wrong level (e.g., identity data that has strategic importance but is controlled by a small tactical organization)

  • Incomplete relationships between identity records or relationships that are enforced outside the system

  • Identity data that is not located on the same system where it is used

  • Identity data for which there are multiple copies and no canonical source

Audit data should be stored in a data repository to which multiple people can be given access. For example, some of the information in the audit will be useful for capacity planning, while other information will be useful for security audits. Sharing the data as widely as possible will build support for the process.

In Chapter 5, we discussed the lifecycle for digital identities. The purpose of creating an identity map is to instantiate that general model for each identity record. This should result in a flowchart that shows how and where the identity record is created, how it is used, and where it ends up. Figure 16-2 shows an example identity map for an employee record.

The map in Figure 16-2 is greatly simplified from what would really happen to an employee record in even a small organization. However, it illustrates the point that the identity map is an instance of the general lifecycle. The important parts of the diagram show:

  • Details about how the record is created

  • Specific instances of how the record used

  • The process by which the record is updated

  • How the record is eventually deleted

All of the dependencies between this record and other systems are recorded in the map, as well as the critical systems that can modify the record. Note that in this simple diagram, the HR system is the only system shown that updates the record, but that's not likely to be the typical case. For example, the payroll system may have access to and update portions of the employee record.


Identity maps are especially useful when you are trying to find the interactions between identity systems and rearchitecting them as part of system or infrastructure upgrades. You may not complete identity maps for every identity record in your organization all at one time, but rather build them as process improvement plans dictate.

The purpose of the process-to-identity matrix is to relate identities to the processes that they support in a visual way. Using the matrix, we can easily identify records that are used by multiple processes as well as identifying areas where different records are serving the same purpose in different processes and might be combined.

Table 16-1 shows an example process-to-identity matrix drawn from the employee provisioning process that we discussed earlier in this chapter. In this matrix, we can see the identity records used by the employee provisioning process as well as how some of those same records are used by the asset management process, the financial audit process, and the enterprise sales process.


Notice that the enterprise sales process doesn't use the employee record or phone record. At the same time, if we were to dig into the inventory and audit data for the salesman performance record, we might find that it contains employee and phone information for the sales team. This illustrates how we can use the matrix to improve processes by normalizing data so that the enterprise sales process derives important employee data from canonical sources.