X

AOL mystery explained

America Online, struggling to recover from the worst commercial online outage in history, explained to its six million customers the combination of human and technical snarls that caused an 19 hour outage yesterday.

Jeff Pelline Staff Writer, CNET News.com
Jeff Pelline is editor of CNET News.com. Jeff promises to buy a Toyota Prius once hybrid cars are allowed in the carpool lane with solo drivers.
Jeff Pelline
3 min read
America Online, struggling to recover from the worst commercial online outage in history, explained to its 6 million customers the combination of human and technical problems that caused a 19-hour outage Wednesday.

According to AOL the outage started at 1 a.m. Pacific time when it took down the system to replace high capacity switches within the local area network in the data centers, located in Virginia. When AOL went to bring the system back up, it couldn't, said AOL vice president of operations, Matthew Korn.

Teams of AOL technicians spent most of the day going over the changes they had made, trying to figure out where the mistake in installing the new switches had occurred. But they couldn't find anything wrong.

Unfortunately, the technicians had spent most of their time looking at the wrong problem. The glitch, it turned out, originated at Advanced Network and Services, a New York based wholly owned subsidiary of AOL.

ANS "erroneously reconfigured the routing information that was to be sent to AOL," America Online CEO and chairman Steve Case said in a written statement released this afternoon. But the company was caught in a Catch-22 situation: because AOL's system was down, the mistake couldn't be detected.

In addition, the vendor who provides AOL's routers found a glitch in their operating system software, according to AOL. Korn refused to say who that vendor is, citing AOL policy.

"We have a practice of not naming vendors who participate in and provide systems and services to America Online," Korn said. "America Online strongly takes the position that it's our responsibility."

Cisco Systems and Bay Networks, among other companies, supply computer networking equipment to AOL and its ANS subsidiary. In April, Bay Networks and ANS announced what they called a "strategic relationship encompassing products, services and joint develpment in the ANS backbone network."

Cisco and a Bay Networks spokesman, reached this afternoon, denied that their companies' products were responsible for any of the problems. They cited AOL's policy of not naming third-party vendors, saying only that it was a "multi-vendor" issue.

Many networking companies are trying to persuade customers to buy a line of products from a single vendor instead of the combined systems often found today. Experts say that such systems can be vulnerable to breakdowns because their parts aren't always compatible or that technicians are not sufficiently trained to make them work.

"We understand that our members and partners were very frustrated and inconvenienced throughout the day," Case said. "We mobilized every possible resource and worked around the clock to correct the problem as quickly as possible."

Case, while saying he couldn't promise that AOL will never have another outage, did say that "the interruption was caused by a coincidental series of sequential events that will most likely never occur again."

But Bob Metcalfe, an Internet critic, said the outage wasn't just a fluke or coincidence: It is an indication that online services, and the Internet as a whole, are too vulnerable to technological and human error.

"Every one of these [errors] is unlikely and will never happen again," he said. "And then another one happens a week later. It will get worse instead of getting better."