I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them. My naming convention is {vendor} {primary scope} {cause}. The scope may be global, a specific region, or a zone in the region. In some cases there are secondary impacts with a wider scope but shorter duration such as regional control planes becoming unavailable for a short time during a zone outage.
This post was written while researching my AWS Re:Invent talk.
Slides: http://www.slideshare.net/AmazonWebServices/arc203-netflixha
Video: http://www.youtube.com/watch?v=dekV3Oq7pH8
November 18th, 2014 - Azure Global Storage Outage
Microsoft Reports
http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption/
http://azure.microsoft.com/blog/2014/12/17/final-root-cause-analysis-and-improvement-areas-nov-18-azure-storage-service-interruption/
http://azure.microsoft.com/blog/2014/12/17/final-root-cause-analysis-and-improvement-areas-nov-18-azure-storage-service-interruption/
January 10th, 2014 - Dropbox Global Outage
Dropbox Report
April 20th, 2013 - Google Global API Outage
Google Report
February 22nd, 2013 - Azure Global Outage Cert Expiry
Azure Report
December 24th, 2012 - AWS US-East Partial Regional ELB State Overwritten
AWS Service Event Report
http://aws.amazon.com/message/680587/Netflix Techblog Report
http://techblog.netflix.com/2012/12/a-closer-look-at-christmas-eve-outage.htmlOctober 26th, 2012 - Google AppEngine Network Router Overload
Google Outage Report
October 22, 2012 - AWS US-East Zone EBS Data Collector Bug
AWS Outage Report
Netflix Techblog Report
June 29th 2012 - AWS US-East Zone Power Outage During Storm
AWS Outage Report
Netflix Techblog Report
June 13th, 2012 - AWS US-East SimpleDB Region Outage
AWS Outage Report
February 29th, 2012 - Microsoft Azure Global Leap-Year Outage
Azure Outage Report
August 17th, 2011 - AWS EU-West Zone Power Outage
AWS Outage Report
April 2011 - AWS US-East Zone EBS Outage
AWS Outage Report
Netflix Techblog Report
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
2008 AWS S3 outage:
ReplyDeletehttp://status.aws.amazon.com/s3-20080720.html
Thanks! Another interesting case.
DeleteAdded it to the list.
DeleteJust reading the Google Apps outage account, and I can't call that comprehensive. Unlike Amazon one's you have no clue what when wrong and where and why. Things failed and got restarted.
ReplyDeleteIt will be interesting to see if Google sees the value in being more open and detailed for these reports as they try to compete more directly with AWS.
DeleteRegarding the April 2013 Google Global API outage -- it may be worth adding a link to their more detailed postmortem: http://googledevelopers.blogspot.ie/2013/05/google-api-infrastructure-outage_3.html
ReplyDelete