by Mark Spenik
You may be asking yourself, "What in the world is a chapter on indexes and index
selection doing in a book about database administration?" My experience as a
DBA has shown me that it is important to understand indexes and how SQL Server uses
indexes to help developers when they are stuck and come to the DBA looking for some
words of wisdom and advice. Not only can you seem like a SQL Server guru by helping
a developer quickly tune a query by adding an index, you need to be aware of the
disk space requirements of indexes and the performance impact of too many indexes
on a single table by an overzealous developer.
This chapter gives you a very basic understanding of an index, the type of structures used by indexes, and how to help developers select the proper indexes for their applications.
To get started, you need to know what an index is. An index is a separate physical database structure created on a table that facilitates faster data retrieval when you search on an indexed column. SQL Server also uses indexes to enforce uniqueness on a row or column in a table or to spread out the data on various data pages to help prevent page contention.
Take a high-level look at how indexing can help speed up data retrieval. Figure 21.1 shows a single table called School Employee that lists the name and occupation of each employee in the school.
Figure 21.1.
The School Employee table.
Using the table shown in Figure 21.1, what would you do if you wanted to select the names of all the people in the School Employee table who were janitors? You would have to read every row in the table and display only the names where the occupation in the row is Janitor. The process of reading every row or record in a table to satisfy a query is called a table scan. Now add an index to the Occupation column. Fig- ure 21.2 shows the index on the Occupation column of the School Employee table.
Figure 21.2.
An index on the Occupation column of the School Employee table.
The type of index shown in Figure 21.2 contains a pointer to the data. Using the index shown in Figure 21.2, walk through the same query to find the names of all employees who are janitors. Rather than performing a table scan on the School Employee table, you read the first row of the index and check the occupation until you find Janitor. When a row contains Janitor, you use the value in the row pointer column to find the exact row number in the School Employee table of a Janitor. You continue to read the index as long as the occupation is Janitor. When the occupation is no longer Janitor, you stop reading the index. Pretty simple right?
Apply to SQL Server this little bit of knowledge of tables and indexes just described. For starters, SQL Server stores data and index information on a data page (see Figure 21.3).
Figure 21.3.
The SQL Server data page.
A data page is 2048 bytes in size with a 32-byte header. The remaining 2016 bytes are used for data (that is, table or index information).
TIP: SQL Server allocates space for tables and indexes eight pages at a time, a grouping called an extent. When the extent is filled, another extent (8 pages) is allocated. Remember this tip if you take the Microsoft SQL Server certification test. The odds are that you will be asked a question on index and table space allocation.
Suppose that the School Employee table, used earlier, contained other information, such as the employee's home address, phone number, spouse's name, education, and number of years of service. The size of a single row of information for the School Employee table would be around 480 bytes; the maximum size of the Occupation column is 25 bytes. Using these row and column sizes, place the School Employee table (shown in Figure 21.1) and the index (shown in Figure 21.2) on SQL Server data pages (see Figure 21.4).
NOTE: The layouts of the table information and index information on the data pages shown in Figure 21.4 are not the actual index and table layouts used by SQL Server. They are used here to help you understand the general idea behind data pages, table scans, and indexes.
Figure 21.4.
The School Employee table and index on SQL Server data pages.
Using the diagram shown in Figure 21.4, how would SQL Server find all the employee
names whose occupation is Assistant Principal without the index? First,
SQL Server would read data page 5 and search each record for an employee with an
occupation equal to Assistant Principal. No records are found on the first
page. SQL Server reads the second data page (page 10), searches each record, and
displays the fourth record. Because SQL Server has no way of knowing how many records
there are with the occupation of Assistant Principal, SQL Server reads and
searches the third and final page, page 12.
What happened? SQL Server performed a table scan, reading all the data pages. In this example, a table scan did not seem all that bad because SQL Server only had to read three data pages. But what if the School Employee table had 1,000 times more records for a total of 9,000 records (a small amount for SQL Server)? SQL Server would have to read 3,000 data pages rather than 3 data pages to find all the Assistant Principals, even if there was only one employee who was an Assistant Principal and that record was located on the first data page.
Walk through the same query using the index. First, SQL Server reads the index page and begins to search for Assistant Principal. The first row read is Assistant Principal. SQL Server then checks the pointer, which tells SQL Server that the record is located on data page 10, row 4. SQL Server reads data page 10, goes to the fourth row, and displays the name.
The next row in the index page is checked; because the occupation is not Assistant Principal, SQL Server stops. The number of pages read using the index is two pages as opposed to the three pages read in the table scan example.
What about the Janitor query used earlier? Performing a table scan requires SQL Server to read all three data pages. Using the index requires SQL Server to read all three data pages plus the index page, for a total of four data pages--one more than a table scan! In some cases, a table scan may be faster than using the index. It's the job of the SQL Server query optimizer to determine which index to select and when to perform a table scan.
TIP: You will read many recommendations in this book about keeping indexes small and the row width of a table small for maximum performance. All too often, the reasoning behind small row and index width is left out. It boils down to data pages and how many data pages SQL Server has to read to fulfill a query. Suppose that you have a table with 500,000 rows; the size of a row (with overhead bytes) is 250 bytes or (2048 bytes in a data page - 32 bytes of overhead) / 250 bytes = 8) 8 records per data page. The number of data pages required for all 500,000 records is (500,000 / 8 = 62,500) 62,500 data pages. Suppose that you look at your overall table design and decide that you can shrink the size of the maximum row width just over 10 percent so that 9 rather than 8 records fit on a data page. The number of data pages is reduced by almost 7,000 data pages! An index data page is the same. Indexing a 20-character field (with overhead) and a 6-byte field (with overhead), for example, is the difference between 100 keys per data page and 336 keys per data page. The larger index requires SQL Server to read three times as many data pages to access the same number of keys. (This does not take into account the added B-Tree levels caused by a larger index key!) Normalize your tables and select smart indexes.
SQL Server maintains indexes with a B-Tree structure (see Figure 21.5). B-Trees are multilevel self-maintaining structures.
Figure 21.5.
The B-Tree structure.
A B-Tree structure consists of a top level, called the root; a bottom level, called the leaf (always level 0); and zero to many intermediate levels (the B-Tree in Figure 21.5 has one intermediate level). In SQL Server terms, each square shown in Figure 21.5 represents an index page (or data page). The greater the number of levels in your index, the more index pages you must read to retrieve the records you are searching for (that is, performance degrades as the number of levels increases). SQL Server maintains two different types of indexes: a clustered index and a nonclustered index.
A clustered index is a B-Tree structure where level 0, the leaf, contains the actual data pages of the table and the data is physically stored in the logical order of the index.
NOTE: When a clustered index is created, a lot of disk I/O occurs, the data pages are ordered, the index pages are created, and the nonordered data pages are deleted. Creating a clustered index requires you to have free space in the database that amounts to approximately 1.2 times the amount of data in the table.
Figure 21.6 shows a clustered index on the Name column in the School Employee table. Notice that the data pages are the leaf pages of the clustered index and that the data is stored in logical order on the data pages.
Figure 21.6.
A clustered index on the Name column of the School Employee table.
NOTE: Because data is physically ordered on the data pages, you can have only one clustered index per table. Select the clustered index wisely.
With a nonclustered index, the leaf level pages contain pointers to the data pages and rows, not the actual data (as does the clustered index). A nonclustered index does not reorder the physical data pages of the table. Therefore, creating a nonclustered index does not require the large amounts of free disk space associated with creating a clustered index. Figure 21.7 shows a nonclustered index on the School Employee table. Notice that the data in the data pages is in the order in which the data was inserted, not in the order of the index key. Also note that the nonclustered index adds one more level by always arriving at the leaf and then having to read the data page.
Figure 21.7.
A nonclustered index on the Name column of the School Employee table.
TIP: You can have up to 249 nonclustered indexes on a table, although you would never want to create anywhere near 249 indexes on a single table. A large number of indexes on a single table affects the performance of other operations, such as UPDATE, DELETE, and INSERT. In SQL Server 6.5, a single-column index cannot exceed 256 bytes in width; a composite index cannot exceed 900 bytes in width (in SQL Server versions before 6.5, the limit was 256 bytes, regardless of index type). Any index cannot exceed 16 columns. Again, you would never want an index that is 900 bytes in width or that contains 16 columns. Remember to use narrow-width indexes to maximize the number of index keys on a data page. This improves performance by requiring less disk I/O to scan the index. Try not to exceed 4 columns when creating indexes. A table can have both clustered and nonclustered indexes. Because you are allowed to have only a single clustered index on a table, you can meet your other indexing needs with nonclustered indexes. Try not to over-index; in many cases, a clustered index and two to four nonclustered indexes are more than sufficient.
It is widely known that an index can help speed data retrievals; from time to time, you may hear someone say that indexes slow down other operations, such as inserts, updates, and deletes--which is true. It has been mentioned that B-Tree data structures are, for the most part, self-maintaining data structures, meaning that as rows are added, deleted, or updated, the indexes also are updated to reflect the changes. All this updating requires extra I/O to update the index pages.
What happens when a new row is added to a table without a clustered index? The data is added at the end of the last data page. What happens when a new row is added to a clustered index? The data is inserted into the correct physical and logical order in the table and other rows may be moved up or down, depending on where the data is placed, causing additional disk I/O to maintain the index.
As the index pages and data pages grow, they may be required to split (a situation beyond the scope of this book), requiring slightly more disk I/O.
In general, you should not worry about the time required to maintain indexes during inserts, deletes, and updates. Be aware that extra time is required to update the indexes during data modification and that performance can become an issue if you over-index a table. On tables that are frequently modified, I try to restrict the tables to a clustered index and no more than three or four nonclustered indexes. Tables involved in heavy transaction processing should be restricted to from zero to two indexes. If you find the need to index beyond these numbers, run some benchmark tests to check for performance degradation.
You can create an index using the Enterprise Manager.
NOTE: You cannot create an index on the following data types:bit
text
image
Indexes cannot be created on a view.
Using the Enterprise Manager, select a database (the following example uses the pubs database). Then follow these steps to create an index:
Figure 21.8.
The Manage Indexes dialog box.
Figure 21.9.
Adding columns to an index with the Manage Indexes dialog box.
Figure 21.10.
The Index Build dialog box.
CAUTION: The table on which the index is being created is locked during index creation. Creating indexes on very large tables or creating clustered indexes (which may reorder the data pages) can take some time to complete. You cannot access the table until the index creation is complete. Try to create clustered indexes and very large nonclustered indexes during non-peak hours.
When an index is created, a row is placed in the sysindexes database system table.
TIP: Always build a clustered index before building a nonclustered index. When a clustered index is built, all nonclustered indexes currently on the table are rebuilt. You also can build an index on a temporary table.
The Transact SQL statement used to create an index is the CREATE INDEX command, which has the following syntax:
CREATE [UNIQUE] [CLUSTERED | NONCLUSTERED] INDEX index_name
ON [[database.]owner.]table_name (column_name [, column_name]...)
[WITH [PAD_INDEX,][[,]FILLFACTOR = x][[,] IGNORE_DUP_KEY][[,] {SORTED_DATA | SORTED_DATA_REORG}]
[[,] {IGNORE_DUP_ROW | ALLOW_DUP_ROW}]]
[ON segment_name]
NOTE: SQL Server 6.x provides another method to create indexes called constraints. A constraint is added to a table during the table creation and can be used to maintain referential integrity. The primary key constraint places a unique index, clustered or nonclustered, on the columns defined as the primary key. You cannot drop a constraint index with the Enterprise Manager (using the Manage Index dialog box). To remove a constraint, you must use the ALTER TABLE command. Here's the good news if you use SQL Server 6.5: You can use the DBCC DBREINDEX statement to dynamically rebuild indexes without having to drop and re-create the index or constraint.
The following sections examine in more detail some of the options and information displayed in the Manage Indexes dialog box and the CREATE INDEX command (see Figure 21.11). Start with the Index Attributes frame.
Figure 21.11.
The Manage Index dialog box.
The Index Attributes frame, shown in Figure 21.11, contains checkboxes that enable you to specify the type of index you want to create. These checkboxes relate to many of the options used in the CREATE INDEX command.
CAUTION: If you try to create a unique index on a column with duplicate data, the index creation will fail. You must remove the duplicate data entries to build the index.
NOTE: The Fill Factor value is not maintained by SQL Server after index creation; it is maintained only when the index is built.
The Table Size and Index Size frames supply useful information, such as the current size of the table and the amount of disk space used by the index. You can use the information in these two frames to help determine whether to use indexes on a table and if there may be performance problems with the indexes created.
The Table Size frame displays the number of rows and the amount of space occupied by the data. Remember: If the table contains a small number of rows, SQL Server does not use indexes and performs a table scan instead. There is no minimum or maximum number of rows before an index or table scan occurs; you have to test your queries and look at the query plan used by the optimizer.
For an index that already exists, the size of the index in disk space is displayed in the Actual Size field and in the Estimated Min./Avg./Max. Potential Size field.
The Estimated Min./Avg./Max. Rows Per Leaf:Node Page label is displayed
for nonclustered indexes; it provides the estimated minimum, average, and maximum
number of rows on the leaf pages and the nonleaf pages of the index B-Tree. For a
clustered index, the label reads Estimated Min./Avg./Max. Rows Per Index Page.
The leaf pages are excluded because they are the actual data pages. Estimated
Min./Avg./Max. Index B-Tree Levels displays the number of levels for the expected
index B-Tree.
TIP: The levels of the B-Tree can point to possible performance problems or poor index selection. A B-Tree with a large number of levels requires more time to find rows because each level adds more data pages that must be read to get to the leaf pages. You can reduce the number of B-Tree levels by reducing the width of the indexed columns, which increases the number of index keys per page.
Let's quickly review other index operations you can perform from the Manage Index dialog box shown in Figure 21.11.
DROP INDEX [owner.]table_name.index_name
[, [owner.]table_name.index_name...]
sp_rename objname, newname [, COLUMN | INDEX ]
Figure 21.12.
The Index Distribution Statistics dialog box.
Index Rating | # of Rows Returned | Comment |
Optimal | 1 row | Good candidate for primary key |
Very Good | > 1 row | < 0.5% of rows |
Good | > 0.5% of rows | < 1% of rows |
Fair | > 1.0% of rows | < 2.5% of rows (Check with query showplan) |
Poor | > 2.5% of rows | < 5% of rows (Do not index) |
Very Poor | > 5% of rows | No way! |
TIP: I'm really impressed with the information provided in the Index Distribution Statistics dialog box. It helps remove some of the guesswork about the data distribution of a table. For developers and DBAs, being able to see the distribution page information and the subjective index rating can help make a sound decision about whether a column is a good index candidate. Use the information in the Index Distribution Statistics dialog box, along with the Query Analyzer dialog box showplan option (described in the next chapter) to further tune your index selection.
Index selection is based on the design of the tables and the queries that are executed against the tables. Before you create indexes, make sure that the indexed columns are part of a query or are being placed on the table for other reasons, such as preventing duplicate data. The following sections suggest some indexing strategies.
The following list shows criteria you can use to help determine which columns will make good indexes:
The following list shows cases in which columns or indexes should not be used or should be used sparingly:
As you know, you can have only one clustered index per table. Following are some situations in which a clustered index works well:
Nonclustered indexes work well in the following situations:
You can determine whether a column is a good candidate for an index by doing some simple math and computing the selectivity of the column. First, you must determine the total number of rows in the table being indexed. You can obtain this information from the Table Size frame of the Manage Indexes dialog box or by using the following SQL command:
Select COUNT(*) FROM table_name
Then you must determine the number of unique values for the column you want to index. To determine this number, execute the following SQL command:
Select COUNT(DISTINCT column1_name) FROM table_name
To determine the number of expected rows returned by using the indexed column, calculate the following formula:
expected number of rows = (1/number of unique values) * Total number of rows in the table
If the expected number of rows is low compared to the total number of rows in the table, the column is a good candidate for an index. You can further validate this by computing a percentage, as follows:
Percentage of rows returned = (expected number of rows/total number of rows in the table) * 100
Compare this value to the values shown in Table 21.1.
Composite indexes are indexes created with two or more columns (the maximum number of columns for an index is 16 columns). SQL Server 6.x keeps distribution page information on all the columns that make up the composite index, but the histogram of data distribution used by the query optimizer is kept only on the first key, so the order of the keys does matter. Use the key with the most unique values as the first key (best selectability). Try not to get carried away by creating composite indexes with a large number of columns. (I try to keep them under four columns.) Too many columns affect performance and make the index key large, increasing the size of the index and requiring you to scan more data pages to read the index keys.
Index covering is a term used to explain a situation in which all the columns returned by a query and all the columns in the WHERE clause are the key columns in a single nonclustered index. SQL Server does not have to read the data pages to satisfy the query; instead, it returns the values on the leaf page of the index. In some cases, a covered nonclustered index can outperform a clustered index.
The downside to index covering is the added overhead to maintain the indexes.
Also, it is very difficult to create indexes to cover the many different queries
executed by your users. Avoid creating indexes to cover queries. You are better off
creating single-column or narrow composite indexes for the query optimizer to use.
"How to Start a Controversy," by Spenik and Sledge
I have been to some very good classes covering optimization and tuning using Microsoft SQL Server and Sybase SQL Server. Not to mention a lot of real-world experience in tuning applications. You know the story: optimizing a stored procedure that ran in 16 hours so that it ran in 20 minutes.... In the first edition of this book, I included some brief information about some rare cases in index covering you should experiment with.
The tech editors burned me up, complaining that the correct use of covered indexes required much more detail and information than I was providing in my brief description. So I decided to check the Microsoft documentation, which also did not favor the use of index covering. Out voted, two to one, I toned down the recommendation of experimenting with index covering in some critical cases. When the first edition of this book was published, everyone liked the book but pointed out that they do not agree with my stand on covered indexes. I even got e-mail from several respected SQL Server experts who commented on the index covering section.
Okay. In this edition, I'm adding my original recommendation that index covering in many cases can give you better performance than a clustered index when running a query. But in all honesty, do not go crazy and try to cover every query you execute. Trying to cover every query you execute is a ridiculous thought and will end up bringing your system performance to a crawl when performing updates, inserts, and deletes. Keep in mind that the optimizer uses only one index. In general, you should stick with single-column or narrow composite indexes for the query optimizer to use. But if you still need better performance and are not returning a lot of columns or have a large WHERE clause and the query is of a critical nature, experiment with index covering. I hope this time around I have covered (no pun intended) all the different viewpoints on this topic!
Following are some important points to remember about SQL Server indexes:
It is important that you understand the basic ideas behind SQL Server indexes.
In the next chapter, you build on the basic concepts of this chapter and learn about
the query optimizer. If you understand indexes and the query optimizer, you will
be able to provide valuable support to developers.
To order books from QUE, call us at 800-716-0044
or 317-361-5400.
For comments or technical support for our books and software, select Talk to Us.
© 1997, QUE Corporation, an imprint of Macmillan Publishing USA, a Simon and Schuster
Company.