Friday, November 16, 2012

Cloud Outage Reports

The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. I've included some Google and Azure outages here because they illustrate different failure modes that should be taken into account. Recent AWS and Azure outage reports have far more detail than Google outage reports.

I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them. My naming convention is {vendor} {primary scope} {cause}. The scope may be global, a specific region, or a zone in the region. In some cases there are secondary impacts with a wider scope but shorter duration such as regional control planes becoming unavailable for a short time during a zone outage.

This post was written while researching my AWS Re:Invent talk.

November 18th, 2014 - Azure Global Storage Outage

Microsoft Reports

January 10th, 2014 - Dropbox Global Outage

Dropbox Report

April 20th, 2013 - Google Global API Outage

Google Report

February 22nd, 2013 - Azure Global Outage Cert Expiry

Azure Report

December 24th, 2012 - AWS US-East Partial Regional ELB State Overwritten

AWS Service Event Report

Netflix Techblog Report

October 26th, 2012 - Google AppEngine Network Router Overload

Google Outage Report

October 22, 2012 - AWS US-East Zone EBS Data Collector Bug

AWS Outage Report

Netflix Techblog Report

June 29th 2012 - AWS US-East Zone Power Outage During Storm 

AWS Outage Report

Netflix Techblog Report

June 13th, 2012 - AWS US-East SimpleDB Region Outage

AWS Outage Report

February 29th, 2012 - Microsoft Azure Global Leap-Year Outage

Azure Outage Report

August 17th, 2011 - AWS EU-West Zone Power Outage

AWS Outage Report

April 2011 - AWS US-East Zone EBS Outage

AWS Outage Report

Netflix Techblog Report

February 24th, 2010 - Google App Engine Power Outage

Google Forum Report

July 20th, 2008 - AWS Global S3 Gossip Protocol Corruption

AWS Outage Report


  1. 2008 AWS S3 outage:

  2. Just reading the Google Apps outage account, and I can't call that comprehensive. Unlike Amazon one's you have no clue what when wrong and where and why. Things failed and got restarted.

    1. It will be interesting to see if Google sees the value in being more open and detailed for these reports as they try to compete more directly with AWS.

  3. Regarding the April 2013 Google Global API outage -- it may be worth adding a link to their more detailed postmortem:


Note: Only a member of this blog may post a comment.