by Orryn Sledge
A properly designed database can increase data integrity and simplify data maintenance. To help you better understand how to design a database, the following concepts are discussed in this chapter:
The following problems can occur because of an improperly designed database:
As you can see in the sample table in Figure 20.1, several names and descriptions
are continuously repeated. This increases the amount of physical storage required
to track training data.
Figure 20.1.
Redundant data.
The table design in Figure 20.2 is limited to tracking three training courses
per employee. Additional columns must be added to the table if you want to track
more than three classes.
Figure 20.2.
Limited data tracking.
Consider the likelihood of a training class being misspelled when a new record
is added to the employee_training table (see Figure 20.3). As more records
are added to the table, the potential for inconsistent data from typing errors increases.
Figure 20.3.
Inconsistent data.
Suppose that you just realized that the SQL Server Admin class should be
named SQL Server Administration (see Figure 20.4). To change the class name,
you must update it in five different places. Wouldn't it be easier if you could change
the name in one place and have it automatically reflected throughout the application?
Figure 20.4.
Update anomalies.
Suppose that you are no longer interested in tracking the Introduction to COBOL
training class, so you delete matching records (see Figure 20.5). But wait...you
just realized that you deleted other important information. The removal of more than
one type of information from a table is considered a delete anomaly.
Figure 20.5.
Delete anomalies.
Suppose that you want to track a new training course titled Database Design
and you designate the code DD for training_id1. What values will
you use for employee_name and manager_name when you insert the
record into the sample table (see Figure 20.6)? Do you leave the values blank? Do
you insert a special code such as unknown for employee_name and
manager_name?
Figure 20.6.
Insert anomalies.
Normalization is a set of standard rules that test the soundness of database design. It can help prevent the problems described in the first part of this chapter. By applying these standard rules, you can pinpoint design flaws that may jeopardize data integrity and complicate data maintenance.
There are three standard normalization rules. After a design successfully passes a rule, it is said to be in # normal form (where the # represents 1st, 2nd, or 3rd). Rules are cumulative. For example, for a design to be in 3rd normal form, it must satisfy the requirements of the 3rd normal form as well as the requirements for 2nd and 1st normal forms.
Technically speaking, there are other types of normalization rules beyond 3rd normal form. However, for most database designs, the first three normal forms are sufficient. You will seldom need to apply the other types of normalization. Therefore, this section concentrates only on the 1st, 2nd, and 3rd normal forms of database design.
Now that you know the rules regarding normalization, apply them to a sample application.
For this application example, suppose that you are tracking training classes taken
by each employee. Figure 20.7 contains a denormalized listing of the data tracked
by this application. Each employee may have taken 0 or N (zero or many)
classes.
Figure 20.7.
A denormalized database design.
1st Normal Form Look at the training_id, training_description,
and training_date attributes in Figure 20.8. See how they are repeated?
This violates the concept of 1st normal form: no repeating groups.
Figure 20.8.
Repeating groups.
Move the training information into a separate table called employee_training
and create a relationship between the employee table and the employee_training
table. Now the table design meets the requirements of 1st normal form (see Figure
20.9).
Figure 20.9.
Tables that meet 1st normal form.
2nd Normal Form In Figure 20.10, notice how the training_description attribute
depends only on the training_id attribute and not on the employee_id
attribute in the employee_training table. This violates 2nd normal form:
no nonkey attributes depend on a portion of the primary key (the primary key for
this table is employee_id + training_id). This rule is applied
only to entities that have compound primary keys (a primary key consisting of more
than one attribute).
Figure 20.10.
A nonkey attribute depends on a portion of the primary key.
Move the training_description attribute into a separate table called training.
Relate the training table to the employee_training table through
the training_id attribute. Now the design satisfies 2nd normal form (see
Figure 20.11).
Figure 20.11.
Tables that meet 2nd normal form.
3rd Normal Form Look at the manager_name attribute for the employee
table in Figure 20.12. The primary key for the employee table is the employee_id
attribute. Does the manager_name attribute depend on the employee_id
attribute? No! This violates 3rd normal form: no attributes can depend on other nonkey
attributes.
Figure 20.12.
An attribute depends on a nonkey attribute.
Move the manager_name attribute into a separate table called manager.
The manager table can be related to the employee table through
the manager_id attribute. By making this change, the design meets the requirements
of 3rd normal form (see Figure 20.13).
Figure 20.13.
Tables that meet 3rd normal form.
Now you have completed the normalization process (see Figure 20.14). This process
helps isolate design flaws that would have led to an awkward and inefficient database
design.
Figure 20.14.
A normalized database design.
Denormalization means that you are purposely designing your database so
that it is not in 3rd normal form. This is done to maximize performance or
to simplify end-user reporting. Whenever you denormalize a database, you must be
willing to forego the benefits gained from the 3rd normal form.
NOTE: I recommend that you start your initial database design in 3rd normal form. If you find that performance problems exist, selectively step back to 2nd or 1st normal form. Keep in mind that when you denor-malize a database, you do so for a specific set of application requirements. Future requirements may not need or benefit from past denormalization decisions. Only denormalize when you have to.
A database design in 3rd normal form may require more table joins to process a query than a design in 2nd or 1st normal form. These additional table joins can be expensive in terms of CPU and disk I/O.
Suppose that you need a report that lists the training classes taken by each employee
(see Figure 20.15).
Figure 20.15.
A sample report.
NOTE: The examples used in this section are overly simplistic; however, they do explain how multitable joins can complicate data processing.
To retrieve the data from your fully normalized database, you create the following query, which is a sample query for a fully normalized database:
SELECT a.employee_name, d.manager_name, c.training_description,b.training_date
FROM employee a, employee_training b, training c, manager d
WHERE a.emp_id = b.emp_id
AND b.training_id = c. training_id
AND a.manager_id = d.manager_id
As you can see, this simple report requires four tables to be joined. Assume that each table contains one million rows. Can you imagine the work involved to join four tables, each containing one million rows? You can be assured that performance will suffer.
To maximize performance, you sometimes have to step back to 2nd or 1st normal form. If you denormalize your data into a single table, you can use the following query, which is a sample query for a denormalized database:
SELECT employee_name, manager_name, training_description,
training_date
FROM training_summary
Another reason to denormalize a database is to simplify ad-hoc reporting. Ad-hoc reporting is the unstructured reporting and querying performed by end users. End users are often confused when they have to join a significant number of tables. To avoid the confusion, DBAs can create a special set of tables designed for ad-hoc reporting. If the data is used for reporting and not online processing, you can avoid some of the problems associated with a denormalized design.
NOTE: Views can sometimes be used as an alternative to denormalization. Views can present your data in a denormalized manner, which can simplify ad-hoc reporting.
Following is a brief summary of the various techniques you can use to denormalize a database:
Figure 20.16.
Horizontal partitioning.
Figure 20.17.
Vertical partitioning.
Following are important notes to remember when designing databases:
You can gain tangible benefits by understanding and following the rules of normalization.
When a normalized design is not feasible, you should selectively denormalize. Keep
in mind the concepts discussed in this chapter when you read the next
chapter--your normalization strategy might influence your indexing strategy.
To order books from QUE, call us at 800-716-0044
or 317-361-5400.
For comments or technical support for our books and software, select Talk to Us.
© 1997, QUE Corporation, an imprint of Macmillan Publishing USA, a Simon and Schuster
Company.