A partial failure at Amazon Web Services' cloud-computing infrastructure brought down some Internet operations today, including the Web sites of Quora and Reddit.
The outage struck the Elastic Compute Cloud (EC2) service at Amazon's northern Virginia site, which handles AWS operations for the U.S. East Coast. The problems began at 1:41 a.m. PT, according to Amazon's AWS status dashboard, with delays and errors when connecting to servers over a network.
A long list of customers has come to rely on Amazon EC2, which provides servers on a pay-as-you-go basis that lets customers ramp or down according to varying computing needs.
Amazon said on the dashboard it was making progress in resolving the problems but as of 9 a.m. PT was still having troubles.
Amazon offered this status update at 8:54 a.m. with a more detailed explanation but not a very optimistic tone. The problem started with a "networking event" that led to problems with how data is mirrored:
We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS [Elastic Block Storage] volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
The problems affected AWS customers:
"We'll be back shortly, we hope. Sorry, it sucks for us too," a note on the Web site of Quora, a site that lets people ask and answer questions. "We'd point fingers, but we wouldn't be where we are today without EC2."
And Reddit, a popular discussion site among the tech set, said, "Amazon is currently experiencing a degradation."
Today's outage also impaired Amazon's relational database service on the East Coast and its Elastic Beanstalk for automatically deploying, managing, and monitoring services. Most other services, such as the widely used Simple Storage Service (S3), appeared unaffected.
Cloud computing takes many forms, but AWS' nuts-and-bolts ingredients are among the biggest successes of the idea. AWS services can be grafted on to a company's internal operations to provide extra computing capacity or to handle one particular operation such as data storage, or it can be the foundation of an entire Internet operation.
When a cloud-computing provider has trouble, of course, it raises worries about the dangers of outsourcing operations to another company. But the full judgment about the merits of cloud computing must also factor in the reliability, expense, and adaptability of in-house operations, too.
Updated 9:18 a.m. PT with some explanation from Amazon about what went wrong.