From angry customers to irate CIOs, service outages are an IT nightmare to be avoided at all costs. It's impossible to prevent them entirely, so post-mortem assessment is critical to understanding how to minimize their impact in the future.
While not strictly a service "outage," Google's search engine mishap last weekend got me thinking about how the company will dissect the incident and what methods they will put in place so similar situations won't happen again.
Charles Foy, a service level manager with Siemens Healthcare, wrote a paper called, "Say Goodbye to Post Mortems, Say Hello to Effective Problem Management" that takes an in-depth look at how his company investigates service outages and learns from them.
Foy says that although there were already methods in place, the need for a new system became evident after Y2K. At first, Foy's team planned to simply house standard post-mortem documents in a centralized folder. They soon realized, however that they could design and entirely new process and database that would eventually lower the amount of unscheduled downtime.
"The benefits of a database of post-mortems were numerous. When implemented, we would have a central repository with records of all outages, the customers affected, downtime incurred, hardware involved, root causes, and the preventive measures implemented," wrote Foy in his paper.
"Our new goal then was to define a process and database that reduces unscheduled outages, increases availability, and communicates the root cause and preventive measures implemented to internal and external audiences," he goes on to say.
To get all the details on how Foy's team developed its new process on problem management, you'll need to download the 11-page paper [PDF]. Its well worth taking the time to read because, although some of the methods may be overkill for small organizations, it's easy to glean plenty of tips that can be applied to companies of any size.