Culture

Facebook suffers 'worst outage' in 4 years; Qwest sees packet loss

For several hours, Facebook users have trouble accessing the service as a key ISP struggles with packet loss. Are the two related?

Tom Krazit Former Staff writer, CNET News

Tom Krazit writes about the ever-expanding world of Google, as the most prominent company on the Internet defends its search juggernaut while expanding into nearly anything it thinks possible. He has previously written about Apple, the traditional PC industry, and chip companies. E-mail Tom.

See full bio

Tom Krazit

Sept. 23, 2010 1:15 p.m. PT

4 min read

The Internet Health Report showed severe packet loss on links between Qwest and other service providers, could this have caused Facebook problems? — The Internet Health Report showed severe packet loss on links between Qwest and other service providers during the time Facebook users reported major problems, although it's not clear whether they are linked. Keynote Systems

Editors' note: By late afternoon, Facebook was saying that it had gotten to the bottom of what it called "the worst outage we've had in over four years." Below is our account of how things unfolded during the course of the day.

Facebook struggled to maintain service availability today during the same period of time as a prominent Internet service provider dealt with a major outage.

We're getting countless reports of Facebook problems from users both inside and outside of CNET, with most reporting that the site has been up and down all morning. Facebook acknowledged that it was experiencing "latency issues" with its API, but it was not immediately clear what was causing those issues.

However, during the same period of time representatives for Qwest, a major Internet service provider, used their Twitter account to alert customers to "a known outage in WA" that "our techs are currently working to restore service as soon as possible." A quick check of Keynote Systems' Internet Health Report around midday Pacific Time showed major packet loss on several key routes between Qwest and other Tier 1 providers, such as SBC and Saavis.

It's not at all clear that the two issues are related: representatives from Facebook and Qwest have not returned multiple requests for clarification. We'll continue to update this post as we receive more information, and please let us know if you're having problems with Facebook and where you're located, as these issues can often be regional.

Updated 1:22 p.m. PDT: Facebook released a statement: "We are experiencing an issue with a third party networking provider that is causing problems for some people trying to connect to Facebook. We are in contact with this provider in order to explore what can be done to resolve the issue. In the meantime, we are working on deploying changes to bypass the affected connections."

We're trying to get them to confirm whether Qwest is that "third-party networking provider," so stay tuned.

Updated 3:03 p.m. PDT: Still no word from Facebook beyond their statement, but things appear to be getting back to normal. Many of those down earlier seem to be able to visit their News Feeds, but are having trouble with side features like Facebook Chat. Also, the Facebook API feeds based on the "Like" and "Share" buttons appear to have returned after disappearing from any sites, including ours.

A Qwest representative said the company was looking into the outage, and hoped to have more detailed information to share this afternoon. The Internet Health Report now shows Qwest's services as back to normal in terms of packet loss, but latency is still in the yellow "warning" zone.

At some point, Qwest's Twitter handler deleted the message informing users of "a known outage."

Updated 3:14 p.m. PDT: Facebook released an updated statement: "Today we experienced technical difficulties causing the site to be unavailable for a number of people. The issue has been resolved and everyone should now have access to Facebook. We apologize for any inconvenience."

Updated 3:50 p.m. PDT: Bob Gravely, a Qwest spokesman, said two incidents happened today on Qwest's network. Earlier this morning Pacific Time, the company suffered an equipment failure in its Tukwila, Wash. offices that was the outage referred to in the now-deleted Twitter message. That outage was confined to the Seattle area, he said.

However, later in the morning Pacific Time a contractor working in Indiana cut through part of the national backbone fiber that Qwest operates, forcing the company to reroute traffic across its entire network in order to bypass the cut fiber, Gravely said. This resulted in the massive traffic backups that were noted on the Internet Health Report, but the situation has returned to normal as of this writing.

Internet companies that operate at Facebook's scale get their bandwidth from a variety of sources, but CNET has confirmed that Qwest is one of them. However, neither company will comment on whether or not there was a link between the two problems.

It's worth noting that Facebook had similar problems yesterday that were also attributed to a third-party networking provider.

Updated 6 p.m. PDT: Facebook released a statement explaining what sparked the outage, which the company called its worst in more than four years:

Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover.

The way to stop the feedback cycle was quite painful--we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.

CNET's Caroline McCarthy contributed to this report.