Outages, complexity, and the stronger cloud
Though outages like the one at Amazon Web Services make for great headlines, the truth is they're part of the circle of cloud life. How providers learn from them is critical to the cloud's future strength.
Theof Amazon Web Services' EBS storage services in one of their service "regions" the week of April 21st has triggered so much analysis--emotional and otherwise--that I chose to listen rather than speak until now. Events like this are tremendously important, not because they validate or invalidate cloud services, but because they let us see how a complex system responds to negative events.
You see, for almost four years now, I've believed that cloud computing is evolving into a complex adaptive system. Individual services and infrastructure elements within a cloud provider's portfolio are acting as "agents" that have various feedback mechanisms with various other services and infrastructure. A change in any one element triggers automation that likely changes the condition of other elements in the system.
It's similar to the behavior of other systems, such as automated trading on the stock market (an example I've).
The adaptive part comes about when humans attempt to correct the negative behaviors of the system (like the cascading EBS "remirroring" in the AWS outage) and encourage positive behaviors (by reusing and scaling "best practices"). Now, expand that to an ecosystem of cloud providers with customers that add automation across the ecosystem (at the application level), and you have an increasingly complex environment with adaptive systems behavior.
The thing is, science shows us that in complex adaptive systems tiny changes to the system can result in extreme behaviors. Events like this will happen again. I don't panic when there is a cloud outage--I embrace it, because the other aspect of complex adaptive systems is that they adapt; they get better and better at handling various conditions over time.
I'll bet this week's VMware Cloud Foundry outage will be an excellent microcosm in which to see this behavior play out. That outage was also triggered by human error. The result will be corrections to the processes that were violated, but also evolution of the service itself to be more resilient to such errors. Cloud services can't afford only to attempt to ban mistaken behavior; they have to find what it takes to remain viable when faced with one of those mistakes.
Outages are inevitable. We all know that's true. We don't have to like it, but we have to live with it. But thinking that good design can eliminate failure outright is naive. Demanding that our providers adapt their services to eliminate specific failures is where the rubber meets the road.
How does one improve the resiliency of a complex adaptive system? By changing its behavior in the face of negative events. We can plan as humans for a fairly wide swath of those events, but we can't protect against everything. So, when faced with a major failure, we have to fix the automation that failed us. Change an algorithm. Insert new checks and balances. Remove the offending element for good.
There is, however, no guarantee that one fix won't create another negative behavior elsewhere.
Which brings me to my final point: many AWS customers felt let down by Amazon as a result of this outage. I think they had the right to feel that way, but not to the extent that they could claim Amazon was negligent in how it either built EBS or handled the outage. Amazon was, however, guilty of not giving customers enough guidance on how to develop resilient applications in AWS.
But guess what? I bet Amazon will fix that, too.