I had phone meeting with a potential new customer last Thursday and the customer said that he was watching the news about a lunatic who crashed a plane into the IRS building. The news didn’t hit the place I was working at yet so I didn’t know what was going on. But as we talked about the news, it reminded me of 9-11, plane crashing into Pentagon. Plane crashing into a building—yep, it definitely reminds you of the threats out there. Eventually we talked business and ended our call and I went about my day.
I’m not sure if the crash took out IT resources (such as servers) for the IRS. I couldn’t help but think, “If the crash took out IRS servers, were they able to recover in a timely fashion”? Yeah I know it’s geeky and nerdy to be thinking about these things but I’m an IT pro—I can’t help but ponder these things! 🙂
Every IT resource—infrastructure and systems alike—should have a disaster/recovery plan. Within government agencies, they like to call this Continuity of Operations Planning or COOP. COOP is pretty comprehensive and includes not just how to recover IT but also how management succession, crisis procedures, etc. If you are part of IT and you get detailed to participate in COOP planning, most of your contributions to this COOP plan will revolve around recovering IT assets such as servers, applications, databases, and the like,
I can’t remember how many times I’ve seen network and DB administrators claim “yeah we have backups” and when disaster comes, the backups were no good and couldn’t be restored! What good is that??! What the hell is that?? See, it’s not enough that you are backing up data and applications; you must also rehearse recovery procedures using the capture back-ups so that you can confidently report to your management, “Yes, we have backups and can recover in the event of a disaster.”
In the SharePoint world, it’s not enough that you are backing up content and configuration databases—you should rehearse recovery procedures from time to time. How do you know you can recover a toasted SharePoint farm configuration if you’ve never rehearsed it and seen with your own two eyes that your backups are good? You must test your recovery procedures.
How often should you backup and test recovery procedures? Well, one can probably write a dedicated Web site just on the topic of disaster-recovery (just google it and there are tons). So, without having to discuss this too much, as a general guide I like to follow, the more critical your systems, the more backups and test-recovery procedures you should do. For less-critical apps, your frequency doesn’t have to be as extensive (the important thing is you do it). Just as an example, for critical apps—daily full back-ups with incremental back-ups during the day and test recovery procedures every 2 weeks. The “least frequent” backup schedule I’ve ever done for a “non-critical” app is bi-weekly full backups and test recoveries every 3 months.
The point is, you must backup and be ready to recover when disaster strikes. Nine years ago, the terrorists attacked WTC and the Pentagon. Last week, some lunatic had a personal grudge against the government. Who knows, maybe this week, some admin at your office spills some latte on your production server and toasts mission-critical apps. Whatever the disaster may be, be ready.