Table of Contents

Chapter 2

Capacity and Growth Planning for a SQL Database

Certification Objectives *

User Data *

Calculating Size for Rows of Data *

Calculating Size for Leaf and Non-leaf Index Pages *

The Transaction Log *

System Tables *

From the Classroom *

Free Disk Space *

SQL Server Files *

SQL Server Data Management *

Placing Files and Filegroups for Optimal Performance *

From the Classroom *

Using NTFS File Compression *

System Processor: CPU *

Memory: RAM *

Hardware Required *

Memory Configuration *

Disk Subsystem *

Database Server Architecture *

Application Design *

Database Schema *

Certification Objectives

One of the adjectives most frequently used to describe SQL Server 7.0 is the word scalable. Scalability is the construct that allows companies to create applications that grow as the business or data needs grow. This being said, scalability also makes it much more difficult to identify the proper hardware, operating systems, and configurations for your application. In this chapter we will look at the following capacity planning issues:

Throughout this chapter we will provide you with the information necessary to determine where your needs fall in the range of performance available with SQL Server 7.0.

Capacity Planning for SQL Server

As we discuss scalability, we will focus primarily on storage capacity. The following discussion of storage capacity requirements is primarily focused on the volume of data that must be maintained in your database during peak periods. It also will focus on where to locate the data to obtain optimal performance. This determines the amount of disk space that must be allocated for data. We will also discuss availability of hardware for SQL Server, as well as the impact it has on transactional and storage capacity.

When implementing a database on Microsoft SQL Server, it is important to accurately estimate the database size and capacity correctly ahead of time. This is so you avoid database overflow and database-down situations later. When estimating the size of your database, you must factor in several items.

The amount of data that will be put into the database can be calculated accurately based upon your knowledge of the number of records you will store and the database schema. If you have a good understanding of the business that impacts your database, you have the information that you need to estimate database size. In prior versions of SQL Server, Microsoft has provided the stored procedure sp_estspace. This tool will estimate how much space will be needed to store a certain number of rows in a given table. Due to the changes to the underlying storage details that were made in SQL Server 7.0, the sp_estspace utility has not yet been updated and is not currently available at this writing. Check the Microsoft website periodically as this tool should be updated and posted as a free download.

To determine the total size of a user database, you must consider the following items:

User Data

The best way to calculate the amount of space that user data will use is to break it down by table. This simplifies the task of calculating the number of rows to expect, as they will vary from table to table. When calculating the amount of space to allocate to each table, you must consider both the data rows and its indexes.

Calculating Size for Rows of Data

The first step in calculating the space that a table will consume is to calculate the space that must be allocated for each row. The amount of overhead in tables containing variable-length fields will be greater. Because of this, we are going to review two different sets of formulas. Lets take a look at the steps for calculating the space required.

  1. Calculating row lengths in bytes The first step in calculating the space needed for a table is to calculate the number of bytes required by each row. To calculate this we must look at each column in the table. Using Table 2-1 and Table 2-2, we can calculate the space needed for each column. When using these tables, it is important to remember that any column that allows NULL characters must be considered variable length and have the associated overhead added. For columns that will differ in length (such as varchar columns), you will need to estimate the average size (estimate high if in doubt). Now that we have defined the bytes for each column, we must add the associated overhead to determine the row length for the purpose of calculating space requirements. We will use one of the two following formulas:
    1. For tables with no variable length columns use this formula:
    2. 2 + (Sum of bytes in all fixed-length columns) = Data Row Size

    3. For tables with one or more variable length columns use this formula:

2 + (Sum of bytes in all columns) = Subtotal

Subtotal + ((Subtotal / 256) + 1) + (Number of variable-length columns + 1) + 2 = Data Row Size

  1. Calculate the number of rows per page After determining the data row size, the next step in calculating the size of a database is to determine the number of rows that fit in a single page. To determine this, you must divide the number of bytes required for each row by the number of bytes available on each data page. By default, each 8K data page uses 132 bytes of overhead, so there are 8060 (2048 – 32) bytes available for data on each data page. If the table contains a clustered index, the Fill Factor setting will impact the amount of available space on each page. Determine the lowest fill factor value that you set. If you do not specify a fill factor, SQL Server leaves room for two rows. This is accounted for in the following formula . If the Fill Factor is specified, multiply the 8060 bytes by the fill factor to determine the number of bytes available on each page and do not subtract two rows. Most likely this calculation will not result in an even number. Best practices tell us to be conservative in our estimation (the cost of excess space is far lower than the cost of running out of disk space). As a result, you should round down divisions for calculating the number of rows per page. Let’s take a look at the formula:
  2. With default fill factor:

    (8060 / Data Row Size) - 2 = Number of Data Rows per Page

    With specified fill factor:

    (8060 * Fill Factor) / Data Row Size = Number of Data Rows per Page

  3. Calculate the number of pages needed Now that you know how many data rows can be stored on a page, we can calculate the amount of space needed to store your projected number of records. To do this, we divide the projected number of data rows by the number of rows per page that we calculated in step 2. For example, if we need to store 10,000 rows and determined that we could store 10 rows on each page, we could calculate that we need 1000 8K pages for the data in this table. Let’s look at the formula below:

Number of Rows in Table / Number of Data Rows per Page = Number of Data Pages (5)

Data TypeSpace RequiredBINARYAs defined (0 to 8000 bytes).BIT1 byte.CHARAs defined (0 to 8000 bytes). DATETIME8 bytes. DECIMAL2 – 17 bytes DOUBLE PRECISION8 bytes. FLOAT8 bytes.IMAGE16 byte pointer, plus a minimum of 2000 bytes for each initialized (non-null) columnINT4 bytes.MONEY8 bytes.NUMERIC2 – 17 bytes REAL4 bytes. SMALLDATETIME4 bytes.SMALLINT2 bytes.SMALLMONEY4 bytes. TEXT16 byte pointer. plus a minimum of 2000 bytes for each initialized (non-null) columnTIMESTAMP8 bytes. TINYINT1 byte. VARCHARStorage size is the actual length of the data entered.VARBINARYLength of data entered plus 4 bytes. Table 1: Bytes for Each Data Type

To calculate the appropriate length of decimal and numeric data types, we must first know the level of precision that was set for each column. Once that is known, use Table 2-2 to determine the number of bytes that are consumed by the columns.

PrecisionSize0 – 22 bytes3 – 43 bytes5 – 74 bytes8 – 95 bytes10 – 126 bytes13 – 147 bytes15 – 168 bytes17 – 199 bytes20 – 2110 bytes22 – 2411 bytes25 – 2612 bytes27 – 2813 bytes29 – 3114 bytes32 – 3315 bytes34 – 3616 bytes37 – 38 17 bytesTable 2: Precision Impact on Length of Decimal or Numeric Datatypes

Calculating Size for Leaf and Non-leaf Index Pages

Now that we have calculated the space needed to store data, we must next calculate the space required for the related indexes. This is in order to know the total space needed to store an index using our desired configuration. Since the leaf pages of a clustered index are the actual data, we only perform this first calculation for non-clustered indexes. Let’s take a closer look at the steps necessary to calculate the space we must allocate for the leaf pages of a non-clustered index.

  1. Calculate the Size of the Leaf Index Row When calculating the space we need to allocate for nonclustered index rows, the first step is to determine the size of the leaf index rows. To do this we need to determine whether the keys are based on fixed or variable-length columns. If all of the keys are fixed-length, we will use the first formula. If any of the keys are based upon variable-length columns, we will use the second formula.
  2. For fixed-length keys only:

    7 + (Sum of length of fixed-length keys) = Size of leaf index row

    For indexes with variable-length keys:

    9 + (Sum of length of fixed-length keys) + (Sum of length of variable-length keys) + (Number of variable-length keys) + 1 = Subtotal

    Subtotal + (Subtotal / 256) + 1 = (Size of leaf index row)

  3. Calculate the Number of Leaf Rows per page After we know the length of the leaf index rows, we can determine the number of rows per page using the following formula:.
  4. 8060 / (Size of leaf index row) = Number of leaf rows per page

  5. Calculate the Number of Leaf Pages in the Index Now we know the number of leaf rows that fit on each page. From here we can calculate the number of pages needed to support the leaf index pages for the number of rows we will have in the table. To implement this, use the following formula:

(Number of rows in table) / (Number of leaf rows per page) + 1 = Number of leaf pages

Note: we add 1 in the leaf page calculation to account for the distribution page that is built by each index. In prior versions of SQL Server, the distribution page was created (and updated) when you ran the update statistics command. In version 7.0, the distribution page is created and updated automatically.

We have now calculated the space needed to store data and leaf pages for all indexes. The next step is to calculate the space required for the non-leaf pages of all indexes. Let’s take a closer look at the necessary steps

  1. Calculate the Size of the Non-leaf Rows The formula for calculating the length non-leaf rows is listed below:
  2. (Size of leaf index row) + 4 = Size of non-leaf row

  3. Calculate the Number of Non-leaf Index Rows per Page After we know the size of non-leaf rows, we can use the following formula to determine how many non-leaf pages we will need.
  4. (8060 / Size of non-leaf row) – 2 = Number of non-leaf index rows per page

  5. Calculate the Number of Non-leaf Index Pages at Level 1 Now that we know the number of non-leaf index rows per page, we can determine how many non-leaf pages we will need
  6. (Number of leaf pages / Number of non-leaf index rows per page) = Number of index pages at Level 1.

  7. Calculate the Number of Non-leaf Index Pages at Each Remaining Level As long as the result is greater than 1, we must repeat the following formula. Use each quotient as the next dividend to calculate the number of pages at each level. When the quotient is equal to 1, you have reached the root level of the index.
  8. (Number of index pages at last level / Number of non-leaf index rows per page) = Number of index pages at next level

  9. Calculate the Total Number of Non-leaf Index Pages We have now calculated the number of pages used at each level. To calculate the total number of non-leaf index pages use the following formula:

(Number of index pages at Level 0) + (Number of index pages at each remaining level) [...] = (Total number of 8K data pages used)

The Transaction Log

Transaction logs are a critical part of successfully implementing a SQL Server user database. The transaction log is a serial record of all modifications that have occurred in the database. The log records the start of each transaction, the changes to the data, and enough information to undo the modifications. The transaction log contributes to the high performance and recoverability of these databases. A transaction log can only perform its function if adequate space is allocated. There are not any hard and fast rules to determining the proper size for a transaction log. Determining the proper size depends on three main factors, and the size of the log is directly proportional to all three.

System Tables

Every user database contains a default set of system tables that are created when the database is created. Initially these tables consume .5 MB of disk space. As objects are added to the database, more rows are written to the system tables. Despite this, the overall size will rarely exceed 2 MB, which makes them insignificant in most calculations.

  1. Exercise 2-1: Sizing a Database

As the database administrator for Northwind Inc., your boss has just informed you that the company has bought out their main competitor. Northwind will triple its staff and anticipates a 500% increase in sales. As a result, Northwind has decided to migrate their business systems from Microsoft Access to Microsoft SQL Server. Your boss is purchasing a new database server and needs you to tell her how much disk space your database will need. For this exercise we are going to estimate the size of the invoices table and related indexes.

We build the prototype database to match the Northwind database that ships with SQL Server 7.0. We research business projections and learn that we may need to store 500,000 records in this database. How much disk space will we need for this? Here are some of the question that you will need to answer.

  1. What is the length of each row? __________________________
  2. How many rows fit on an 8K page? _______________________
  3. How many pages are needed for 500,000 rows? ______________
  4. How many indexes do we have? __________________________
  5. How large are the non-leaf rows? _________________________
  6. How many rows fit on an 8K page? ________________________
  7. How many pages at Level 1? _____________________________
  8. How many levels will the index have? _____________________
  9. How many pages are needed for the index? _________________
  10. How many total pages are needed? ________________________

Note: If this calculation proves too difficult, you can always add 500,000 rows to the table and execute sp_spaceused. Running this stored procedure with the default rows loaded returns the information seen in Figure 2-1.



Figure  SEQ Figure \* ARABIC 1: Determining How Much Space is Currently Being Used

From the Classroom

Free Disk Space

A new feature of SQL 7.0 is the ability to grow your database files beyond their initial size. If you use this feature, SQL Server can expand its data files until your hard drive are full. Although this feature is very useful, your hard drives will eventually become full and SQL Server will stop working.

To avoid this problem it is very important to monitor your free disk space on a regular basis. You can use Windows NT Performance Monitor or Event Viewer, or configure the Alerter Service to provide real-time monitoring.

The Alerter Service (see Illustration 1) is a simple system that will deliver important system messages to a designated computer or user. Please note that the Alerter Service is not a "store and forward" system like email; if you are not logged on you will not receive your notification.

 EMBED PBrush 

Illustration  SEQ Illustration \* ARABIC 1: Configuring Alerter Service

David Smith, MCSE + Internet

Planning the Physical Placement Of Files

Microsoft SQL Server 7.0 maps a database over a set of operating system files. Data and log information are never mixed on the same file, and individual files are used by one database only. This eliminates the need for managing logical devices as required in earlier versions of SQL Server.

Exam Watch: In prior versions of SQL Server, it was critical to assign the transaction log to a separate operating system file in order to back it up separately. In version 7.0, SQL Server forces the log onto a separate file by default.

SQL Server Files

In SQL Server 7.0, databases may be built on three types of files:

SQL Server Data Management

Unlike prior versions, SQL Server 7.0 data and log files can grow automatically from their originally specified size. The growth increment is defined at the time the file is created. Each time the file fills it increases its size by the growth increment. If there are multiple files in a filegroup, they do not grow automatically until all files have filled up. When all files are full, growth occurs in a round-robin algorithm.

Each file can also have a maximum size specified. If a maximum size is not specified, then the file can continue to grow until it has used all available space on the disk. This feature is especially useful when SQL Server is used as the database in, for example, an application for mobile users where the user does not have access to administration tools. The user can let the files grow automatically as needed.

Microsoft made additional modifications to how SQL Server is stored on the disk. Let’s take a closer look at those changes:

These modifications to SQL Server file formats allow the server to scale from low-end to high-end systems, therefore improving performance and manageability. The new page and row formats support row-level locking and improve performance when large blocks of data are accessed. This is because each I/O operation retrieves more data.

Planning the Use of Filegroups

We have already explained that a database consists of one or more data files and one or more log files. Microsoft has implemented a concept know as filegroups to allow the data files to be grouped together. Tables and indexes can then be mapped to different file groups to control data placement on physical disks.

File groups work well for advanced users who know where they want to place indexes and tables. If the administrator does not specify filegroups, all files are included in the default filegroup, and SQL Server 7.0 allocates data within the database.

Using files and filegroups improves database performance. It does this by allowing a database to be created across multiple disks, multiple disk controllers, or RAID systems. For example, if your computer has four disks, you can create a database that comprises three data files and one log file, with one file on each disk. As data is accessed, four read/write heads can simultaneously access the data in parallel, which speeds up database operations. Files and Filegroups can offer significant improvements in performance and recoverability of your SQL Server databases. The only stipulation is that they must comply with the following rules:

Now that we have seen the rules that govern how files and filegroups can be implemented, let's take a look at the best practices for managing files and filegroups:

Placing Files and Filegroups for Optimal Performance

Now that we understand that files and filegroups give us the flexibility to place database objects, it is important that we understand how file placement impacts database performance. Let’s take a look at the trends that will impact your decisions.

Given these facts, it should make sense that optimal performance is obtained by placing the transaction log on an isolated drive and distributing the remaining data objects across multiple drives. For fault tolerance, transaction logs function best using disk mirroring, while RAID 5 is best for the remainder of the database.

From the Classroom

Using NTFS File Compression

Since SQL databases are just files, they can be stored on NTFS partitions that are compressed. While this could save you some disk space, it would reduce the performance of SQL Server significantly. It is not recommended that you use NTFS compression for SQL databases and logfiles.

—David Smith, MCSE + Internet

Planning for Growth Over Time

As a database administrator, it is your responsibility to create an environment that is reliable and responsive. This can be fairly easy to do in an environment with a clearly defined load. This can become much more challenging as time passes and users demand and data volume changes. To be successful, you must plan for these changes when setting up and configuring your environment. When planning for the growth of data, you need to focus on two primary items: how much data will be added to my database, and how long will the data be retained.

When projecting how much the database will grow over a specific period of time, it is best to obtain projections about the percentage growth factor that the enterprise expects. All peak periods of growth need to be identified here as well. For example, in implementing a database for the retail industry, the busy Christmas season needs to be accounted for.

The second important factor is how long a given set of data will reside in the database before it is archived. For example, the business unit may have decided to keep the sales history for the past two years. It is important to get a clear definition of this period. For example, say the database must always contain the data from the past two calendar years towards the end of the current year. If this is the case, then the database will actually store the sales history for the last three years. These factors have to be taken into account when estimating the capacity and configuration of the database. If you are creating the system, and not loading any historical data, then you won’t face many of the performance issues today that will come down the road.

Planning the Physical Hardware System

Hardware planning as it relates to SQL Server 7.0 is primarily concerned with system processors, memory, disk subsystem, and the network. These four areas comprise the bulk of all relevant hardware platforms on which Windows NT and SQL Server operate.

System Processor: CPU

In determining the best CPU architecture for your particular needs, you must estimate the level of CPU intensive work that will be occurring on the hardware platform. With SQL Server, CPU intensive work may occur when a large cache is available and is being optimally used. Another situation may be when a small cache is available with a great deal of disk I/O activity. To properly assess this, you need to address the following questions:

 

The answer to these questions will provide you with the information you need to make your decision. Given that you have two processor families to select from, Intel or Alpha, this will be your first decision. If you do not have very high processor requirements or existing Alpha servers in your environment, Intel is typically a safer decision. From here the rule of thumb is to purchase the most powerful CPU available for the server you select, and to also purchase a server that will accept additional processors. The demand on most database servers increases over time, Therefore, it is more cost effective to add processors to an existing server than to buy a new server.

Memory: RAM

Determining the optimal memory configuration for a SQL Server solution is crucial to achieving desired performance. SQL Server uses memory for its procedure cache, data and index page caching, static server overhead, and configurable overhead. To make SQL Server perform properly requires the right hardware as well as the right configuration to use the resources. The number one limitation to SQL Server performance is Disk I/O. Having sufficient memory to maximize the benefit of data caching and minimizing memory swapping will pay large dividends.

Hardware Required

SQL Server requires a minimum of 32MB of RAM to be installed. This configuration would allow SQL Server to function, but would not perform well under a production load. Microsoft recommends a minimum of 64MB of RAM for production servers. In addition to this specification, you should factor in the following configuration settings that may increase your memory requirements.

Memory Configuration

With the development of SQL Server 7.0, Microsoft was challenged by the need to make improvements on both ends. To install SQL Server on Windows 9x machines, it has to run in environments with minimal memory. To effectively support Data Warehousing for large enterprises, SQL Server must effectively manage large volumes of memory. Microsoft was successful as SQL Server can run with as little as 32MB of RAM and effectively manage as much as 3 GB of memory.

In addition to the range of memory that SQL Server 7.0 supports, Microsoft has added dynamic memory management between Windows NT and SQL Server. When SQL Server is using memory dynamically, it queries the system periodically to determine the amount of free physical memory available. SQL Server grows or shrinks the buffer cache to keep free physical memory at 5 MB (plus or minus 200K) to prevent Windows NT from paging. If there is less than 5 MB free, SQL Server releases memory to Windows NT that usually goes on the free list. If there is more than 5 MB of physical memory, then SQL Server recommits memory to the buffer cache. SQL Server adds memory to the buffer cache when its workload requires more memory. A server at rest does not grow its buffer cache. This configuration eliminates problems that prior versions experienced when too much memory was allocated to SQL Server, and subsequently Windows NT began paging excessively.

Disk Subsystem

Achieving optimal disk I/O is the most important aspect of designing an optimal SQL Server solution. The disk subsystem configuration as addressed here consists of at least one disk controller device and one or more hard disk units. It also makes considerations for disk configuration and associated file systems. The goal when purchasing your disk subsystem is to select a combination of components and technologies that will complement the performance characteristics of SQL Server. The disk subsystem components and features you should look for are as follows:

Given these aforementioned guidelines, certain thoughts should come about. Decisions regarding the determination of how many drives, their size, configuration, and level of fault tolerance, are all made by looking at the storage capacity and transactional capacity requirements of your environment. A good method for determining the number of the disk units required for an optimal disk subsystem is to multiply the number of I/O’s per application transaction by the number of application transactions per second generated by the application. This will tell you the total number of application generated I/O’s per second. By dividing this number by the average sustainable I/O’s per second of the physical disk units you will use, you can determine the optimal number of disk units for this particular SQL Server solution. Don’t forget to add devices as necessary to account for your selected fault tolerance.

Assessing the Communication Requirements

As you work to determine how much network bandwidth is necessary to provide your users with the desired response time, it is important to address the network traffic in both directions. Your application passes a query to the database, and the database returns the data to your application where it is displayed. There are a number of items that can impact the network bandwidth utilization. We are going to take a closer look at the following items which have a major impact:

Database Server Architecture

One of the significant differences between Microsoft Access and Microsoft SQL Server lies in how they resolve queries, and in the impact this has on network bandwidth. When retrieving 1 row from a table of 1 million rows, querying a flat file database such as Access consumes significantly more network bandwidth than performing the same query against a SQL Server database. The reason for this is that a flat file database server passes the entire table to the client and allows the client to filter for the row desired. In contrast, the SQL Server database engine filters the data and passes the single row across the network. Given a row length of 100 bytes, this can be the difference between passing 100 bytes or 100 megabytes across the network. This difference becomes very significant when applications are run across low speed network connections.

Application Design

When designing your application, it is possible to use stored procedures to significantly reduce the network requirements. For long and complicated queries that are repeated frequently, a stored procedure can be run. This reduces the traffic from the client to the server, as only the stored procedure name and associated parameters are passed across the network. In addition, multi-step queries that perform additional filtering or processing based upon the response to initial queries can run much more efficiently as a stored procedure. By using a stored procedure, it is not necessary to pass the results of the initial query to the client so that a second query can be passed to the server.

Database Schema

For years computer engineers have used a variety of techniques to minimize the size of data stored in databases. The use of codes to reduce the amount of data being stored is not new. As the price of data storage continues to drop, this practice is not used as frequently to reduce storage requirements. This technique can also reduce network utilization by reducing the amount of data passed from the database server to the client. While this may sound insignificant, reducing 25–50 bytes per row in result sets of 1000 rows can have a significant impact on your network.

Certification Summary

In this chapter we showed you the proper method for estimating the space required to store your data. While this sounded like an easy task, we showed how to calculate the space for data rows, index rows, and overhead. We also discussed the space required for your transaction log, and how business trends and backup schedules combine to determine how much space is required at a single point in time. Once that number was calculated, we discussed how data retention periods and business trends impact the projected growth to your database that you must manage.

Once you have outlined your disk space requirements, it is important to determine how to distribute the data and transaction logs across the drives on your server to provide the best combination of performance and fault tolerance. Combining this optimal software configuration with the right hardware to minimize disk I/O and maximum processor performance will greatly increase your odds of success.

In assessing the communication requirements for your SQL Server implementation, consider the items which have a major impact on the network bandwidth needed to provide your users with the desired response time These include database server architecture, application design, and database schema. You can employ proven techniques not only to reduce storage requirements, but to reduce network utilization.

Two-Minute Drill