BlackBerry outage: The day after
Three-hour service outage has yet to be fully explained, but it highlights two old stories: the problems with a single point of failure, and the CrackBerry.
Update 2:15 p.m. PST: No sooner do I post this than RIM goes and issues an explanation for the outage. Read on for the details...
In the immortal words of Cinderella's Tom Kiefer, you don't know what you got, till it's gone.
Monday's widespread BlackBerry outage--the second major one in the past 12 months--left Research In Motion customers stranded and cut off from the rest of the world, sort of like what happened to the '80s glam metal band after Long Cold Winter. The Internet's equivalent of a snow day left reams of e-mail messages undelivered for about three hours Monday, according to RIM, which either still hasn't figured out exactly what caused the problem, or isn't willing to disclose the cause just yet.
Representatives for AT&T and Verizon told several media outlets Monday that from what they understood, all wireless carriers in North America that work with RIM were affected. The last time an outage of this magnitude occurred, in April, RIM blamed a database problem that snowballed when the backup "failover" process didn't work as planned.
It's amazing how dependent people have become on their mobile devices. CrackBerry addiction is an old story, but it keeps surfacing every time people are forced to go more than 10 minutes without access to their e-mail. Local television stations in San Francisco all teased the BlackBerry outage on their 11 p.m. newscasts as a near-disaster, since we don't have weather events out here to keep people watching the local news.
While coverage of the outage just goes to show how mobile devices like the BlackBerry really are becoming the next wave of personal computing, it also points out that the entire system has a single point of failure: RIM itself.
All e-mail messages sent to or from a BlackBerry in North America must at some point in their journey travel through RIM's network operations center (NOC) in Canada. The company tried to use that to its advantage in its patent dispute with NTP, noting that since such a critical part of the service lies in Canada, RIM should be exempt from U.S. patent claims. That didn't take.
The Wall Street Journal reported Tuesday that expansion efforts at RIM's NOC may have been to blame for the outage. The problem isn't that the servers are in Canada; they could be anywhere. It's just that everything has to go through the one location. In theory, as long as you have enough redundant backup systems and plans, that shouldn't be a problem. But every now and then, it is.
Frank Gilman, the chief technology officer for Los Angeles law firm Allen Matkins, was forced to deal with the outage Monday afternoon. "What surprised me was the apparent lack of a solid business continuity plan on RIM's part to ensure reasonable connectivity," he said via e-mail, of course. "A company that is marketing devices that increase the mobility of professionals should have systems and contingencies in effect to avoid an outage of that size and duration."
I'm sure that far more BlackBerry-related disasters are averted that never come to light. But RIM has an advantage over other service providers in that few people sign service-level agreements (SLAs) with RIM for the BlackBerry service. SLAs are basically promises from hosted service providers to maintain a certain level of uptime, which is usually 99.999 percent or so.
Those promises are usually only worth the paper they're printed on, however, as the process of actually accounting for and proving damages as a result of an outage can be extremely difficult. Given the degree to which many large businesses--not to mention U.S. government staffers--rely on the BlackBerry service, perhaps RIM's larger customers will start thinking about negotiating such an agreement when it comes time to renew the service.
As frustrating as the outage may have been, it's not like the U.S. economy ground to a halt Monday afternoon as millions of e-mails about sales presentations and reminding the people on the fourth floor to empty the refrigerator on alternate Fridays went undelivered.
Still, RIM still needs to come clean about what caused the problem if it wants to keep people hooked on its service. I've seen the thumb wheel and the damage done.
Update 2:15 p.m. PST: RIM sent out a statement after waiting for me to post this blog, just to make sure we could test our own update procedures.
The company is blaming "a problem with an internal data routing system within the BlackBerry service infrastructure that had been recently upgraded," according to the statement. RIM has been upgrading its capacity as demand for the BlackBerry continues to grow, and usually there isn't much of a problem during one of those upgrades. This time, something apparently went wrong.
"Once again, RIM apologizes to its customers for any inconvenience." The company said it would share further details once a more in-depth investigation is completed.