Commentary: Operational excellence deters serious Web outages

Microsoft's problems with its Web sites are a measure of the depth of the emotional response this company evokes.

Jan. 2, 2002 4:43 p.m. PT

4 min read

Microsoft's very public problems with its Web sites this week quickly became a measure of the depth of the emotional response this company evokes, as detractors and defenders gathered to debate the causes of the problem.

See news story:
Microsoft blames technicians for massive outage

Initially, some Microsoft haters said the problem showed the poor state of Microsoft's software or security.

Then, as people discovered that the Web sites themselves were running and could be accessed directly through their IP addresses, the debate focused on the possibility that the DNS (domain name service) server was hacked and that the problem was not bad software at all.

Finally, word came from Microsoft that someone had misconfigured a router. The problem ultimately was not with the technology, but with IT operations. This event is the latest demonstration of how fragile and vital today's networks are.

People who want to throw stones at Microsoft should realize that they also live in glass houses. They should go down to their glass house (data or operations center) and make very sure that their operations group is well funded and has implemented a strong operational excellence plan, because operational excellence is the only defense against this kind of embarrassingly public systems failure.

Refining critical operations
The key to minimizing these types of operational issues is to continually refine critical operations processes. Process refinement consists of several major activities that need to be done regularly for each process, including the following:

• Documenting workflows and tasks using flow charts and specific "how to" directions, particularly as processes change due to new workloads, technology change and process improvements.

• Identifying the key skills that operations personnel must have to be able to perform the process (specifying what needs to be on someone's resume to be competent at a process).

• Automating routine aspects of a process. The key is to focus on automating the most commonly occurring repetitive tasks.

• Defining best practices--the goals that operations teams are striving for in performing a particular process at a level of excellence.

• Identifying the metrics that must be captured to measure operations performance levels and that are used to make sure that progress is being made toward achieving best practices.

Organizations that excel at process refinement take this exercise to the next level and actually have a set of steps associated with each process. Key processes for dealing with the type of issue that Microsoft faced in the last few days include configuration management, change management, problem resolution, quality assurance, and business continuity/disaster recovery.

A full program for excellence
However, process refinement is only one part of an overall operations excellence program. A full operational excellence program requires significant commitment and consists of ongoing projects targeting the following:

• Rapid assimilation: Coordinating closely with business planners and IT developers to rapidly assimilate new applications, mergers and acquisitions into operations.

• Process refinement: Continuously improving operations processes.

• Creation of IT products and service catalogs: Operations should have a catalog of products and services they provide, associated pricing and service levels.

• Creation of centers of excellence: Aggregating mature processes to provide end-to-end services and making best use of staffing and resources.

• Organizational change: As processes are refined and a product/services catalog evolves, new skills are required and reporting structures need to evolve.

• Service level and metrics reporting: Great operational organizations evolve from gathering and reporting operations-specific metrics linked to business-related measurements.

Early coordination and an appropriate handoff between applications teams and operations is critical. The schism that currently exists between operations and applications teams is extremely troublesome. It leads to poor understanding on the part of operations staff about the business and IT context for decisions about problem resolution, configuration management, change management, and so on.

Avoiding wholesale and partial failures
Today's networks are as fragile and vital as the old banking networks. When those networks failed, the staff would do a "checkpoint restart," backing up to the last point at which things worked well and rebuilding based on that configuration. If Microsoft had a strong, well-documented process similar to that, it might have been able to resolve the problem in an hour instead of a day.

Companies also need to guard against partial failures of Web sites that deny users access to some services or information without causing the entire site to crash. In many cases, Web configuration errors that result in problems falling short of a complete outage (for example, some content not accessible, database-driven pages not working) may go undetected by Web operations staff for a significant period. IT organizations should invest in quality-of-service technologies that can automatically monitor sites (by, say, periodically executing test scripts) to bring problems to operations staff in a timely manner.

Strong operational excellence and security policies are even more vital today than before the Microsoft incident, because this experience has publicly spotlighted for hackers a major potential weak point in the Internet. While the Microsoft problem was caused by an internal error, a hacker successfully attacking a DNS server could create a similar incident. Because the configurations of these servers replicated and interconnected, it could take hours to work the results of such an attack out of a system, even if it was countered quickly.

Events such as Microsoft's Web site outage should be a strong reminder to user organizations to drive toward operational excellence. Only then will they be able to avoid most problems and react quickly and decisively when problems do occur.

Meta Group analysts Val Sribar, Dale Kutnick, William Zachmann, Melinda Ballou, David Folger and David Cearley contributed to this article.

Visit Metagroup.com for more analysis of key IT and e-business issues.