Today, April 29, 2011, Amazon Web Services released a "summary" of its EC2 (Elastic Compute Cloud) and RDS (Relational Database Service) disruption in its U.S. East Region. This came approximately one week after what appears to be a classic example of a that occurred after someone incorrectly executed a communications network traffic shift as part of "normal AWS scaling activities." I read human error here--long known as the leading cause of large system failures.
The rolling disaster is a well understood phenomenon in IT and can be hard to foresee with a complex system. The way to discover and fix potential failure points is to test on a regular basis then build around them. But periodic testing can become difficult for a system of this magnitude.
What I find positive about the Amazon summary is a set of disaster recovery recommendations for users and an admission that AWS customer support during the outage was less than stellar. The disaster recovery recommendations should now be required reading for every AWS customer. In fact, I think that all cloud services users should read this statement with an eye to discovering potential holes in their own disaster recovery strategies.
AWS users now have hope that Amazon will take its self-advice on customer support to heart as well. In the week between the outage and the summary statement, I noted three public statements from Amazon. The day after the outage, AWS offered training services for $2,000 on its blog page. On April 26, Amazon CTO Werner Voguls posted a letter to Amazon shareholders, congratulating them for investing in Amazon. The only press release out of Amazon since the outage has been an announcement of first quarter 2011 financial results. Sales were up 38 percent. Yes, there's definitely some customer support and PR work left to do.
However, for me the big unanswered questions are these: How much data was actually lost as a result of this outage? And what was the value to the EC2 customer of that data?
In the summary statement, AWS continually refers to data as "volumes"--a logical concept that doesn't really describe data--as opposed to files or some other capacity metric like gigabytes. I find this misleading. Volumes can contain large numbers of files and can be large in terms of actual capacity. One section of the summary concludes with the following statement: "Ultimately, 0.07 percent of the volumes in the affected Availability Zone could not be restored for customers in a consistent state." In other words, EC2 customers' data was lost from the perspective of EC2. In terms of actual capacity, this could be a small amount or this could be a large amount--we don't really know.
Saying that 0.07 percent of volumes stored at the failed region can't be recovered really doesn't reveal the true extent of the damage. If, for example, the vast majority of all volumes at this site were small in size with only a few large volumes, but it was the large volumes that were lost, AWS could have lost a significant amount of customer data while only losing 0.07 percent of the volumes. It could be that the smaller volumes were easier to recover than the larger ones.
Furthermore, some of those volumes--small or large--could have been critical to the functioning of other customer processes. If AWC lost a database index for example, then the entire database could be irretrievable from the perspective of a customer that doesn't have a backup copy of the index.
Amazon's summary statement is a laudable exercise for customers moving forward with EC2. But for me, the true extent of the damage from last week's outage is yet to be known.