Friday, November 16, 2012

Cloud Outage Reports

The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. I've included some Google and Azure outages here because they illustrate different failure modes that should be taken into account. Recent AWS and Azure outage reports have far more detail than Google outage reports.

I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them. My naming convention is {vendor} {primary scope} {cause}. The scope may be global, a specific region, or a zone in the region. In some cases there are secondary impacts with a wider scope but shorter duration such as regional control planes becoming unavailable for a short time during a zone outage.

This post was written while researching my AWS Re:Invent talk.
Slides: http://www.slideshare.net/AmazonWebServices/arc203-netflixha
Video: http://www.youtube.com/watch?v=dekV3Oq7pH8


January 10th, 2014 - Dropbox Global Outage

Dropbox Report


April 20th, 2013 - Google Global API Outage

Google Report


February 22nd, 2013 - Azure Global Outage Cert Expiry

Azure Report


December 24th, 2012 - AWS US-East Partial Regional ELB State Overwritten

AWS Service Event Report

http://aws.amazon.com/message/680587/

Netflix Techblog Report

http://techblog.netflix.com/2012/12/a-closer-look-at-christmas-eve-outage.html


October 26th, 2012 - Google AppEngine Network Router Overload

Google Outage Report


October 22, 2012 - AWS US-East Zone EBS Data Collector Bug

AWS Outage Report

Netflix Techblog Report


June 29th 2012 - AWS US-East Zone Power Outage During Storm 

AWS Outage Report

Netflix Techblog Report


June 13th, 2012 - AWS US-East SimpleDB Region Outage

AWS Outage Report


February 29th, 2012 - Microsoft Azure Global Leap-Year Outage

Azure Outage Report


August 17th, 2011 - AWS EU-West Zone Power Outage

AWS Outage Report


April 2011 - AWS US-East Zone EBS Outage

AWS Outage Report

Netflix Techblog Report

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html


February 24th, 2010 - Google App Engine Power Outage

Google Forum Report


July 20th, 2008 - AWS Global S3 Gossip Protocol Corruption

AWS Outage Report



6 comments:

  1. 2008 AWS S3 outage:

    http://status.aws.amazon.com/s3-20080720.html

    ReplyDelete
    Replies
    1. Thanks! Another interesting case.

      Delete
    2. Added it to the list.

      Delete
  2. Just reading the Google Apps outage account, and I can't call that comprehensive. Unlike Amazon one's you have no clue what when wrong and where and why. Things failed and got restarted.

    ReplyDelete
    Replies
    1. It will be interesting to see if Google sees the value in being more open and detailed for these reports as they try to compete more directly with AWS.

      Delete
  3. Regarding the April 2013 Google Global API outage -- it may be worth adding a link to their more detailed postmortem: http://googledevelopers.blogspot.ie/2013/05/google-api-infrastructure-outage_3.html

    ReplyDelete