15.10 How do I…Measure and improve checkpoint performance?
Problem
I want to review my database’s checkpoint performance. If it checkpoints too frequently, then performance will degrade, but if it does not checkpoint enough then recovery, if necessary, could be particularly time consuming. How do I analyze checkpoint performance and improve it if possible?
Technique
When a database checkpoints, the Log Writer (LGWR) and Database Writer (DBWR) background processes write all the dirty buffers in the database buffer cache and the redo log buffer to disk, respectively and in this order. Checkpoints occur in a variety of situations, but regardless of the values of all other database parameters, Oracle initiates a checkpoint at every log switch. Recall that LGWR writes to redo log files in a circular fashion, proceeding automatically to the next file (this is the log switch) when the current file is full.
The database can encounter two problems during a checkpoint. The first occurs when Oracle initiates a second checkpoint before a prior checkpoint is complete. This results in some unnecessary I/O but is not particularly detrimental to performance. The second problem is a bit more complex and requires some understanding of the redo log structure. Oracle cannot let LGWR overwrite any redo entries that might be necessary for instance recovery. Oracle cannot be sure that redo entries are unnecessary until the DBWR process has written the corresponding data from the database buffer cache to disk. So, if, during a checkpoint, LGWR tries to overwrite a redo log file before the DBWR process has written the corresponding database buffer cache buffers to disk, then Oracle forces LGWR to wait. If this happens often, then performance can degrade.
The UTLESTAT report helps you determine if your database is initiating a second checkpoint before Oracle has completed a prior one (the first problem). If this is a consistent problem, then the solution is to increase the size of the redo log files. The alert log indicates if LGWR is waiting for checkpoint completion during log switches (the second problem). As for problem 1, the solution here is also to increase redo log file size, but another possibility is to add more redo log groups.
Steps
1. Review the section of the UTLESTAT report indicated in Listing 15.12.
Listing 15.12 An excerpt from the UTLESTAT report summarizing checkpoint performance
Statistic Total Per Transact Per Logon Per Second ------------ ------------ ------------ ------------ ------------ DBWR timeouts 8 8 4 .2 SQL*Net roundtrips to/from 28 28 14 .7 background checkpoints comp 2 4 2 .1 background checkpoints star 4 4 2 .1 background timeouts 38 38 19 .95 If the values in the total column for background checkpoints comp and background checkpoints star differ by more than one as they do in Listing 15.12, then Oracle does not have enough time between log switches to complete checkpoints. The only solution is to use larger redo log files.
2. Start a SQL*Plus session and connect as the WAITE user. Use the START command to load and execute the script CHP15_17.SQL. The script and its output appear in Figure 15.13.
This script is a subtle variation of the CHP1_5.SQL script from Chapter 1. The only difference is in line 3 where CHP15_17.SQL retrieves the size of each redo log member. This script reports that there are three redo log groups and that the size of each redo log member file is 200K.
3. The only way to change the size of redo log groups is to add new groups with redo log members of the desired size. Use the START command to load and execute the script CHP15_18.SQL. The script and its output appear in Figure 15.14.
CHP15_18.SQL adds a fourth multiplexed redo log group where each redo log member is of size 500K. Modify the script to create as many new log groups as you need.
Note - Script CHP15_18 is operating system-specific because it contains physical file names. Modify it to match your database’s naming conventions and your operating system’s physical file names before using it.
4. Use the START command to load and execute the script CHP15_19.SQL to drop the log file groups whose members were too small. The script and its output appear in Figure 15.15.
5. Oracle records important database events, like database startups and shutdowns, in the alert log. The initialization parameter BACKGROUND_DUMP_DEST points to the location of the alert log. The alert log is a text file and its name usually contains the word alert. Review the alert log for your database to determine if LGWR is waiting for checkpoints to complete. Any messages resembling the one in Listing 15.13 indicate that LGWR is waiting for checkpoint completion.
Listing 15.13 An excerpt from the alert log indicating that LGWR is waiting for a checkpoint to complete
Thread 1 cannot allocate new log, sequence 9905
Checkpoint not complete
6. If this message appears often in the alert log, then LGWR waits are probably adversely affecting database performance. The solution is to add more redo log groups or to increase the size of your redo log members. In either case, modify the scripts CHP15_18.SQL and CHP15_19.SQL to accomplish either of these tasks.
How It Works
Step 1 shows an excerpt from the UTLESTAT report indicating that Oracle does not have enough time between log switches to complete checkpoint processing. Step 2 highlights a script to report on the current redo log configuration. Steps 3 and 4 show how to add new redo log groups and drop existing groups respectively. Step 5 shows how the alert log can indicate contention between the checkpoint processing and the LGWR process.
Comments
If, in Step 4, you receive a message like this one:
ORA-01623: log 3 is current log for thread 1 - cannot drop
then the group you tried to drop is the current redo log group. Prior to re-attempting Step 4, force a log switch with this command:
alter system switch logfile;
The alert log is noteworthy. You should review it on a regular basis as part of proactive database monitoring. Oracle writes to this text file on a cumulative basis so it can grow very large. You may want to delete it or archive it to facilitate review.
Two initialization parameters affect checkpoint processing. The first is LOG_CHECKPOINT_TIMEOUT, which specifies the number of seconds between each checkpoint. Oracle ignores this parameter if its value is 0, the default. If this parameter has a non-zero value, then Oracle checkpoints at log switches and every X seconds, where X is the value of LOG_ CHECKPOINT_TIMEOUT.
The second initialization parameter is LOG_CHECKPOINT_INTERVAL, which specifies the number of operating system blocks that Oracle writes to the redo log between each checkpoint. If you only want Oracle to checkpoint at log switches, make sure that the value of this parameter exceeds the size of the redo log members. Determining the operating system block size may not be straightforward; under NT or Windows 95, use the CHKDSK command on the drive where the redo log files reside. CHKDSK will report that some number of bytes exist in each allocation unit; this is the operating system block size. The default value of LOG_CHECKPOINT_INTERVAL under Windows NT is 10,000 operating system blocks.
The LOG_CHECKPOINT_INTERVAL and LOG_CHECKPOINT_TIMEOUT initialization parameters are useful if your installation features large redo log files and you want checkpoints to occur more frequently than at every log switch.