Page 23
FIG. 2.5
Revised many-to-many
(M:N)relationship
using two one-to-many
(1:M and 1:N)
relationships.
An ERD folds nicely into the relational model because it was created for that purpose. Essentially, entities become tables and attributes become columns. Identifier attributes become primary keys. Relationships don't really materialize except through intersection tables. Foreign keys are created by always placing the primary keys from a table on a one side into the table on a many side. For example, a relationship of one state to many cities would call for you to place the state primary key in the city table to create a foreign key there, thus forging the relationship between the two.
Many automatic Computer Assisted Software Engineering (CASE) tools exist in the current market to help you accomplish just this mapping. Examples include LogicWorks' ERwin and Oracle's own Designer/2000. These tools permit you to not only draw the ERDs, but also specify primary and foreign keys, indexes, and constraints, and even generate standard Data Definition Language (DDL) SQL code to help you create your tables and indexes. For Oracle, you can run those scripts directly, but frequently you need to modify them, for example, to change datatypes or add storage parameters. These tools can also help you reverse engineer the logical model from an existing database which has none documented! This is especially useful when attempting to integrate databases, or when you must assume DBA responsibilities of an already built database. So, CASE tools not only help you to design and build database systems, but they also can help you to document as well.
Normalization is a refinement, or extension, of the relational model. Normalization is also a process that acts upon the first draft relational model and improves upon it in certain concrete ways that we'll discuss soon. The foundation of normalization is mathematical, like the relational model. It is based on a concept known as functional dependency (FD).
Page 24
Although it isn't necessary to get bogged down in the mathematics of functional dependency, it is useful and instructive to at least define it for our context, the relational model. A column or set of columns, Y, is said to be functionally dependent on another column or set of columns, X, if a given set of values for X determine a unique set of values for Y. To say that Y is functionally dependent on X is the same as saying X determines Y, usually written as X -> Y. Of course, the most obvious example is the primary key of a relational table uniquely determining the values of a row in that table. However, other dependencies may exist that are not the result of the primary key. The main purpose of normalization is to rid relational tables of all functional dependencies that are not the result of the primary key.
Here are the three major reasons for normalization that are usually always cited in any database analysis and design text:
In the last section, we passed quickly over how the atomic data requirement (the information principle) is tantamount to First Normal Form (1NF). But let's reemphasize it:
First Normal Form (1NF): No repeating groups. This is the same as saying that the data stored in a cell must be of a single, simple value and cannot hold more than one piece of information.
Table 2.1 lists states with cities whose populations increased at least five percent over the previous year. Because all the city information is stored in a repeating group, this table is non-normal, or not 1NF. First of all, how do we know for sure that the populations and percentages in the columns to the right of the cities belong to those cities? We could assume an order to them, of course, but this violates the fundamental relational rule that columns have no order. Worse, arrays would have to be used, and this requires end-users to know about and use a physical data structure such as an array. This surely can't make for a good user interface.
Page 25
Table 2.1 States with Cities Having >= 5% Population Increases
STATE | ABBREV | SPOP | CITY | LPOP | CPOP | PCTINC |
North Carolina | NC | 5M | Burlington, | 40K | 44K | 10% |
Raleigh | 200K | 222K | 11% | |||
Vermont | VT | 4M | Burlington | 60K | 67.2K | 12% |
New York | NY | 17M | Albany, | 500K | 540K | 8% |
New York City, | 14M | 14.7M | 5% | |||
White Plains | 100K | 106K | 6% |
To make it 1NF, move repeating groups from across columns to down rows. Table 2.2 shows the same table in 1NF, with STATE as the primary key. However, this table still suffers from problems. To update or delete state information, we must access many rows and programmatically guarantee their consistency (the DBMS won't do it). To insert city info, we must add state info along with it. If we delete the last city of a state, the state info goes with it, and so on. What's the problem? We'll see in a moment.
Table 2.2 States and Cities in First Normal Form (1NF)
STATE | ABBREV | SPOP | CITY | LPOP | CPOP | PCTINC |
North Carolina | NC | 5M | Burlington | 40K | 44 K | 10% |
North Carolina | NC | 5M | Raleigh | 200K | 222K | 11% |
Vermont | VT | 4M | Burlington | 60K | 67.2K | 12% |
New York | NY | 17M | Albany | 500K | 540K | 8 % |
New York | NY | 17M | New York City | 14M | 14.7M | 5% |
New York | NY | 17M | White Plains | 100K | 106K | 6% |
To tackle the higher normalization levels, we need a nonkey column. The strict definition of a nonkey column is simply one, which is not part of the primary key. The broader definition of a nonkey column is one that is not part of any candidate key. For our purposes, we'll take the strict definition. Essentially, the set of columns of a table can be thought of as having a primary key and the remainder. Any column that is part of the remainder is a nonkey column.
Second Normal Form (2NF): No partial dependencies. Every nonkey column depends on the full primary key, including all of its columns if it is composite. Our Table 2.2 does not currently comply with this criterion. City information does not depend on state information. Namely, all the city columns (CITY, LPOP, CPOP, and PCTINC) do not depend on the state name (STATE). Hence, we break them up into 2 tables (Tables 2.3 and 2.4). It only makes sense that states and cities are separate entities, although related, and therefore should be separate tables.