Practical Disaster Recovery, Part 2

Andrew Zanevsky

Last month Andrew shared his experiences preparing for disaster recovery, human aspects of SQL Server crisis situations, and problems dealing with vendors' technical support. In this sequel, he compares different backup and recovery methods and offers his view on disaster situation compromises. Included on this month's Developer's Disk is Andrew's stored procedure sp_bcp, which you can use to generate command files to export and import data with bcp.

ONCE you implement a good database backup and support routine, complete with automated periodic SQL Server health checkups, it's natural to want to test your handiwork. What you need is a corrupt database to make sure that the scheduled DBCC routine finds and reports corruption as you expected. But it seems that databases corrupt when they feel like it and not when you order them to. Here's how you can "fake" a database corruption:

I had reservations about including this information since it can obviously be abused. But I do it in the spirit of providing legitimate DBAs the means to simulate a true disaster and make it relatively easy for you to practice error detection and recovery. Nevertheless, anyone with sa authority could use the procedures to corrupt any database, not just a test database. Handle with care.

Backup and recovery techniques

There are several different techniques you can use to back up and recover your SQL Server databases.

Database dumps

Here are some observations on the ever-expanding variety of media you can use to perform database dumps:

SQL Server's DUMP DATABASE is smart enough to allow transactions to run in the database while it's being backed up. If a transaction attempts to modify certain pages that haven't been backed up, SQL Server will back up these pages first before allowing the transaction to continue. Dump files include only used database pages, so don't be surprised if you get a dump significantly smaller than the declared database size.

Loading a database from a dump generally takes longer than it takes to make a backup since SQL Server still has to format an unused portion of a loaded database.

Sometimes you can fix corruption by dumping and reloading a database. This is the case if corruption existed on unused pages because they aren't usually backed up. However, if used pages of the database had been corrupted before the dump was made, then reloading restores corruption as well. By the way, a page erroneously marked as used will be backed up even when no object claims it. That explains why some error messages don't go away when you dump and reload a database.

Copy data files

Simple file copying is an unconventional technique that can be very effective in certain scenarios. Data files that correspond to SQL Server devices may be copied as any other operating system file. The only problem is that SQL Server service must be stopped during the operation. Imagine that you have two databases of the same size with identical device and segment structures on different servers, and you want to copy the contents of one into another. The source server may be stopped for an extended period of time, but you want to minimize downtime on the target one. Provided that you have enough room on disks, you can proceed as follows:

In this scenario, the target server is taken down only for several minutes, and it comes back with a new database—no matter how large it may be. I want to label this technique as "Handle with care" since it's crucially important that you copy all data files that contain the database, including transaction log devices. SQL Server devices (files) must not be shared with other databases that aren't copied. If you miss a device, you may end up rebuilding the database from scratch. Make sure that you understand mapping of databases—to devices—to files and the structure and contents of such system tables as sysdevices, syssegments, sysusages, and sysdatabases. It may be wise not to delete data files that you are replacing on the target server, but rename them and keep them handy until you verify that the operation worked as expected. It provides an option of renaming again.

Naturally, using the file level backup approach, any database corruption that existed in the source database will be replicated regardless of whether it was on a used or unused page.

This technique can also be used when you have a corruption and want to restore the database from a backup but want to keep the corrupt database so you can come back later and try to recover some lost data. You may then save the database data file rather than attempt dumping it. A dump may fail on a corrupt database.

This method may also save you many hours of down time in non-disaster situations. One such task is periodically reloading data into a large decision-support database through bcp. Instead of shutting down operations on the database for the duration of the load, consider creating another database of the same size with data files sitting in the same directories as the primary database files. Users can continue working with the primary database while you're loading flat files into the secondary twin one. Once bcp is done, you shut down the server, rename data files, and restart the service. In a matter of seconds the freshly loaded database becomes the primary one and the old one can be used for the next data load.

Data file backup generally proceeds faster than a SQL Server dump unless there's a lot of unused space in the database. In that case, a dump is faster since it only backs up used pages. Restoring a data file is always faster than loading from a dump.

Warning

Never restart SQL Server service while a data file is being copied. The result is that all databases with segments on the source device will be marked as "suspect." To fix the problem, you'll need to reset the status column in sysdatabases and restart SQL Server.

bcp in and bcp out

I don't recommend the bcp in and bcp out approach unless there's no other way to rescue data, and you can't afford to lose it. You probably already know that bcp is inherently slow compared to other backup and recovery techniques. Furthermore, bcp requires that you reconstruct all database DDL scripts and user accounts and permissions information. (SQL Object Manager in SQL Server 4.2x and SQL Enterprise Manager in 6.0 are handy in reverse-engineering DDL.) Nevertheless, at some point all data leaves the database and sits in data files that aren't protected by RDBMS. I'm never 100 percent sure that SQL Server will load this data back once the database is rebuilt. Furthermore, several problems may occur during the bcp in step.

For example, the transaction log may overflow. To resolve this turn the "select into/bulkcopy" option on and not create indexes on tables until you load them. It may help to activate truncate transaction log on checkpoint. If the transaction log still overflows, then you need to increase its size. Even though bulkcopy operations per se aren't logged, new extents are allocated for a table as it is populated, and this process requires some logging.

Another potential "gotcha" is that your database may run out of space when creating clustered indexes. Creating a clustered index requires that free space in the database can't be less than 1.2 times the size of the table. (Hint: create a clustered index on the largest table first, then on other tables, then create non-clustered indexes.) You may still need to increase the size of your database. Creating all indexes on all tables may take so much time that you'll want to consider other means of recovery.

It's imperative that you practice a complete bcp out and subsequent bcp in operation on a real production size database at least once to estimate the time required to perform this operation and to find bottlenecks up front.

Enclosed on this month's SQL Server Professional Developer's Disk you'll find my stored procedure sp_bcp. It automates the generation of command files for data export and import with bcp. sp_bcp provides flexible options of flat filename formatting. Filenames may be composed of database, owner, table name, date, and time in any combination. Stored procedure is also capable of producing eight-character filenames unique for each table within a given SQL Server. This may be particularly useful on operating systems limited to the 8.3 file naming convention.

Use Transfer Manager to copy the database

Using Transfer Manager to copy the database is just an automated variation of the previous technique, and although TM may offer an acceptable alternative for small databases, remember that you need to have another database where you copy all the objects and data—even though it doesn't have to be the same size. What you gain in ease of use, however, is offset by your lack of control in cases such as transaction logs or database overflows.

The "warm" backup approach

The warm backup approach is the most reliable and most expensive solution. Every database dump is loaded onto a backup server and subsequent transaction logs are also applied. It's easy to swap servers at any time, even though you'll lose changes since the last transaction log dump. If you can afford having two servers instead of one, you'll greatly reduce disaster risks. To reduce sensitivity to natural disasters, fire, and other physical damage to your server, you should set up the backup server in another location. You'll incur increased WAN traffic for this additional protection.

The replication approach

Replication is similar to the previous technique, but more complicated to set up. There may be a delay in copying the data, and you may also lose some of the last transactions preceding disaster.

Application-specific backup and recovery methods

Depending on your application and environment, you may have additional recovery options such as reconstructing the data from other sources. It's often true for decision support, read-only databases.

Compromises

Suppose that you have two recovery plans. Plan A will safely restore all data in 48 hours, while plan B will take only an hour, but has a five percent chance of losing some data. Which one should you choose? Ultimately, users should decide. A good DBA must propose all realistic recovery alternatives. Some sites will be willing to take reasonable risks. Here are some practical compromises that speed up recovery at the expense of added risk.

Compromise 1

Once you've fixed corruption, consider not running a complete DBCC if it takes an enormous amount of time. Instead, take a database dump, load it on another server, and do the DBCC there. Users will continue to work with the recovered database while you are DBCCing the copy. It may happen that they will hit more corruption, in which case you'll have to stop production and follow safe procedures. But in most cases, you won't discover new error messages and will save your users many hours of otherwise lost productivity.

Warning

Remember, if corruption exists on unused pages that aren't backed up, DBCCing the copy won't discover the problem.

Compromise 2

If a single table is corrupt, attempt to restore that one table without bringing down the whole database. Try to copy the table's data out with bcp, and then select into, or insert into, another table. If you succeed, drop the corrupt table and rename the new one. If the corruption remains, it'll most likely exist on unused pages. Run DBCC NEWALLOC to find any corruption.

Compromise 3

Consider skipping DBCC CHECKDB, which usually adds significantly to the backup/restore process and won't notice problems on unused pages anyway. Performing a CHECKTABLE on the tables in question is often sufficient.

Compromise 4

Use the NOINDEX option to reduce runtime at the expense of not checking indexes. This is a particularly useful strategy if you have a strong reason to believe that corruption was isolated to data pages and indexes were intact. Most of the time spent performing CHECKDB and CHECKTABLE typically goes to index checking. In some cases it may take less time to drop all indexes, execute DBCC, and then re-create indexes rather than running DBCC on a fully indexed table.

Compromise 5

Don't fix errors 2540 and 2546 unless you have countless numbers of them and you can reclaim plenty of space by fixing them. Each message 2540 indicates a 2K loss. Let them stay in the database if correction requires precious production time.

Sometimes you might hear that not fixing 2540 may lead to other problems. Not exactly. It's no exaggeration to say that I've seen hundreds of thousands of message 2540s. Of course I haven't counted each one by hand, but automated jobs comb DBCC outputs and count messages for me every week. In my experience leaving them in never led to further damage. It's true, though, that numerous messages 2540 apparently indicate that something is wrong with your storage device, and more problems—including fatal ones—may occur if you continue to ignore the problem.

Compromise 6

Try to find information about "unsupported" DBCC options. They aren't necessarily documented and publicized, but they may be helpful in understanding and fixing the problem or learning more about a particular case of corruption. You'll find some of them buried in troubleshooting procedures. Here are a few such options:

· DBCC FIX_AL fixes "innocent" cases of error 2540 (lost pages). It normally takes the same amount of time as NEWALLOC.

· DBCC PGLINKAGE may help in analyzing broken page chains.

· DBCC PAGE shows the contents of a particular database page.

· DBCC TRACEON(3604)/TRACEOFF(3604) switches DBCC command output between the error log and your screen.

Warning

Don't seek your vendor's support on unsupported DBCC commands except when a support engineer offers to run them for you to resolve a problem.

It's well-explained in the documentation, though often forgotten, that DBCC should be executed in a single-user database mode. When it runs concurrently with transactions modifying data in the same database, DBCC produces unreliable results. You may see scores of non-existing errors. Message 2521 looks especially scary, but in all the likelihood it indicates only that some pages have been modified while DBCC NEWALLOC was in progress. When SQL Server says "Database is not in single user mode—may find spurious allocation problems due to transactions in progress," it means it.

I hope many of today's corruption problems will be eliminated in forthcoming product updates and with improved hardware reliability. In my experience, both the frequency and severity of SQL Server disasters have declined substantially in the last few years. Nevertheless, I'd be delighted when this article becomes obsolete and we enter a corruption-free database future.

Andrew Zanevsky, an independent consultant, has worked as a DBA and as an applications developer for several Fortune 500 companies using both Microsoft and Sybase versions of SQL Server. He is president of the Great Lakes SQL Server Users Group (http://www.glssug.com/glssug) located in Chicago. 708-609-8783, CompuServe 71232,3446.

 

To find out more about SQL Server Professional and Pinnacle Publishing,
visit their website at
http://www.pinpub.com/sqlpro/.

Note: This is not a Microsoft Corporation website.
Microsoft is not responsible for its content.

This article is reproduced from the June 1996 issue of SQL Server Professional. Copyright 1996, by Pinnacle Publishing, Inc., unless otherwise noted. All rights are reserved. SQL Server Professional is an independently produced publication of Pinnacle Publishing, Inc. No part of this article may be used or reproduced in any fashion (except in brief quotations used in critical articles and reviews) without prior consent of Pinnacle Publishing, Inc. To contact Pinnacle Publishing, Inc., please call (800)788-1900 or (206)251-1900.