Adrian Cockcroft's Blog: azure

Showing posts with label azure. Show all posts

Thursday, April 03, 2014

Public Cloud Instance Pricing Wars - Detailed Context and Analysis

As part of my opening keynote at Cloud Connect in Las Vegas I summarized the latest moves in cloud, the slides are available via the new Powered by Battery site as "The Good the Bad and the Ugly: Critical Decisions for the Cloud Enabled Enterprise". This blog post is a detailed analysis of just part of what happened.

Summary points

AWS users should migrate from obsolete m1, m2, c1, c2 to the new m3, r3, c3 instances to get better performance at lower prices with the latest Intel CPUs.
Any cloud benchmark or cost comparison that uses the AWS m1 family as a basis should be called out as bogus benchmarketing.
AWS and Google instance prices are essentially the same for similar specs.
Microsoft doesn’t appear to have the latest Intel CPUs generally available and only matches prices for obsolete AWS instances.
IBM Softlayer pricing is still higher, especially on small instance types
Google's statement that prices should follow Moore’s law implies that we should expect prices to halve every 18-24 months
Pricing pages by AWS, Google Compute Engine, Microsoft Azure, IBM Softlayer
Adrian’s spreadsheet summary of instances from the above vendors at http://bit.ly/cloudinstances
Analysis of the prices by Rightscale

On Tuesday 25^th March 2014 Google announced some new features and steep price cuts, the next day Amazon Web Services also announced new features and matching price cuts. On Monday 31^st March Microsoft Azure also reduced prices. Many pundits repeated talking points from press releases in blog posts but unfortunately there was little attempt to understand what really happened, and explain the context and outcome. When I wrote up a summary for my opening keynote at Cloud Connect on 31^st March I looked at the actual end result and came up with a different perspective and a list of gaps.

I’m only going to discuss instance types and on-demand prices here. There was a lot more in the announcements that other people have done a good job of summarizing. The Rightscale blog linked above also gives an accurate and broader view on what was announced. I will discuss other pricing models beyond on-demand in future blog posts.

There are some things you need to know to get the right background context for the instance price cuts. The most important is to understand that AWS has two generations of instance types, and is in a transition from Intel CPU technology they introduced five or more years ago to a new generation introduced in the last year. The new generation CPUs are based on an architecture known as Sandybridge. The latest tweak is called Ivybridge and has incremental improvements that give more cores per chip and slightly higher performance. Since Google is a recent entrant to the public cloud market, all their instances types are based on Sandybridge. To correctly compare AWS prices and features with Google, there is a like-for-like comparison that can be made. AWS is encouraging the transition by pricing its newer faster instances at a lower cost than the older slower ones. In the recent announcement, AWS cut the prices by obsolete instance type families by a smaller percentage than the newer instance type families, so the gap has just widened.

Old AWS instance types have names starting with m1, m2 and c1, c2. They all have newer replacements known as m3, r3 and c3 except the smallest one – the m1.small. The newer instances have a similar amount of RAM and CPU threads, but the CPU performance is significantly higher. The new equivalents also replace small slow local disks with smaller but far faster and more reliable solid-state disks, and the underlying networks move from 1Gbit/s to 10Gbit/s. The newer instance families should also have lower failure rates.

Most people are much more familiar with the old generation instance types, and competitors write their press releases they are able to get away with claiming that they are both faster and cheaper than AWS, by comparing against the old generation products. This is an old “benchmarketing” trick – compare your new product against the competitions older and more recognizable product.

For the most commonly used instance types there is a close specification match between the AWS m3 and the Google n1-standard. They are also exactly the same price per hour. Since AWS released its changes after Google, this implies that AWS deliberately matched Google’s price. The big architectural difference between the vendors is that Google instances are diskless, all their storage is network attached, while AWS have various amounts of SSD included. The AWS hypervisor also makes slightly more memory available per instance, and ratings for the c3 imply that AWS is supplying a slightly higher CPU clock rate for that instance type. I think that this is because AWS has based its compute intensive c3 instance types on a higher clock rate Ivybridge CPU rather than the earlier Sandybridge specification. For the high memory capacity instance types it is a little different. The Google n1-himem instances have less memory available than the AWS r3 equivalents, and cost a bit less. This makes intuitive sense as this instance type is normally bought for its memory capacity.

Microsoft previously committed to match AWS prices, and in their announcement their comparisons matched the m1 range exactly at it’s new price, and they compared their memory oriented A5 instance as cheaper than an old m2.xlarge, but the A5 is an older slower CPU type, more expensive ($0.22 vs $0.18) and has less memory (14GB vs. 15GB) than the AWS r3.large. The common CPU options on Azure are aligned with the older AWS instance types. Azure does have Intel Sandybridge CPUs for compute use cases as the A8 and A9 models, but I couldn't find list pricing for them and they appear to be a low volume special option. The Azure pricing strategy ignores the current generation AWS product, so the price match guarantee doesn’t deliver. In addition the Google and AWS price changes were effective from April 1^st, but Azure takes effect May 1^st.

IBM Softlayer has a choose-what-you-want model rather than a specific set of instance types. The smaller instances are $0.10/hr where AWS and Google n1-standard-1 are $0.07/hr. As you pick a bigger instance type on Softlayer the cost doesn’t scale up linearly, while Google and AWS double the price each time the configuration doubles. The Softlayer equivalent of the n1-standard-16 is actually slightly lower cost than Google. Softlayer pricing on most instances is in the same ballpark as AWS and Azure were before the cuts, so I expect they will eventually have to cut prices to match the new level.

Gaps and Missing Features

The remaining anomaly in AWS pricing is the low-end m1.small. There is no newer technology equivalent at present, so I wouldn’t be surprised to see AWS do something interesting in this space soon. Generally AWS has a much wider range of instances than Google, but AWS is missing an m3.4xlarge to match Google's n1-standard-16, and the Google hicpu range has double the CPU to RAM ratio of the AWS c3 range so they aren’t directly comparable.

Google has no equivalent to the highest memory and CPU AWS instances, and has no local disk or SSD options. Instead they have better attached disk performance than AWS Elastic Block Store, but attached disk adds to the instance cost, and can never be as fast as local SSD inside the instance.

Microsoft Azure needs to refresh its instance type options, it has a much smaller range, older slower CPUs, and no SSD options. It doesn’t look particularly competitive.

Conclusion

If you buy hardware and capitalize it over three years, and later on there is a price cut; you don’t get to reduce your monthly costs. Towards the end your CPUs are getting old, leading to less competitive response times and higher failure rates. With public cloud vendors driving the costs down several times a year and upgrading their instances, your model of public vs. private costs needs to factor in something like Moore’s law for cost reductions and a technology refresh more often than every three years. Google actually said we should expect Moore’s law to apply in their announcement, which I interpret to mean that we can expect costs to halve about every 18-24 months. This isn’t a race to zero; it’s a proportional reduction every year. Over a three-year period the cost at the end is a third to a quarter of the cost at the start.

I still hear CIOs worry that cloud vendor lock-in would let them raise prices. This ruse is used to justify private cloud investments. Even without switching vendors, you will see repeated price reductions for the public cloud systems you are already using. This was the 42^nd price cut for AWS, the argument is ridiculous.

I’ve previously published presentation materials on costoptimization with AWS. I’m researching this area and over the coming months will publish a series of posts on all aspects of cloud optimization.

Friday, November 16, 2012

Cloud Outage Reports

The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. I've included some Google and Azure outages here because they illustrate different failure modes that should be taken into account. Recent AWS and Azure outage reports have far more detail than Google outage reports.

I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them. My naming convention is {vendor} {primary scope} {cause}. The scope may be global, a specific region, or a zone in the region. In some cases there are secondary impacts with a wider scope but shorter duration such as regional control planes becoming unavailable for a short time during a zone outage.

This post was written while researching my AWS Re:Invent talk.
Slides: http://www.slideshare.net/AmazonWebServices/arc203-netflixha
Video: http://www.youtube.com/watch?v=dekV3Oq7pH8

November 18th, 2014 - Azure Global Storage Outage

Microsoft Reports

http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption/

http://azure.microsoft.com/blog/2014/12/17/final-root-cause-analysis-and-improvement-areas-nov-18-azure-storage-service-interruption/

January 10th, 2014 - Dropbox Global Outage

Dropbox Report

https://tech.dropbox.com/2014/01/outage-post-mortem/

April 20th, 2013 - Google Global API Outage

Google Report

http://googledevelopers.blogspot.com/2013/05/google-api-infrastructure-outage.html

February 22nd, 2013 - Azure Global Outage Cert Expiry

Azure Report

http://blogs.msdn.com/b/windowsazure/archive/2013/03/01/details-of-the-february-22nd-2013-windows-azure-storage-disruption.aspx

December 24th, 2012 - AWS US-East Partial Regional ELB State Overwritten

AWS Service Event Report

http://aws.amazon.com/message/680587/

Netflix Techblog Report

http://techblog.netflix.com/2012/12/a-closer-look-at-christmas-eve-outage.html

October 26th, 2012 - Google AppEngine Network Router Overload

Google Outage Report

http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html

October 22, 2012 - AWS US-East Zone EBS Data Collector Bug

AWS Outage Report

http://aws.amazon.com/message/680342/

Netflix Techblog Report

http://techblog.netflix.com/2012/10/post-mortem-of-october-222012-aws.html

June 29th 2012 - AWS US-East Zone Power Outage During Storm

AWS Outage Report

http://aws.amazon.com/message/67457/

Netflix Techblog Report

http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html

June 13th, 2012 - AWS US-East SimpleDB Region Outage

AWS Outage Report

http://aws.amazon.com/message/65649/

February 29th, 2012 - Microsoft Azure Global Leap-Year Outage

Azure Outage Report

http://blogs.msdn.com/b/windowsazure/archive/2012/03/09/summary-of-windows-azure-service-disruption-on-feb-29th-2012.aspx

August 17th, 2011 - AWS EU-West Zone Power Outage

AWS Outage Report

http://aws.amazon.com/message/2329B7/

April 2011 - AWS US-East Zone EBS Outage

AWS Outage Report

http://aws.amazon.com/message/65648/

Netflix Techblog Report

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

February 24th, 2010 - Google App Engine Power Outage

Google Forum Report

https://groups.google.com/forum/#!topic/google-appengine/p2QKJ0OSLc8

July 20th, 2008 - AWS Global S3 Gossip Protocol Corruption

AWS Outage Report

http://status.aws.amazon.com/s3-20080720.html

Saturday, June 16, 2012

Cloud Application Architectures: GOTO Aarhus, Denmark, October

Thanks to an invite via Michael Nygard, I ended up as a track chair for the GOTO Aarhus conference in Denmark in early October. The track subject is Cloud Application Architectures, and we have three speakers lined up. You can register with a discount using the code crof1000.

We are starting out with a talk from the Citrix Cloudstack folks about how to architect applications on open portable clouds. These implement a subset of functionality but have more implementation flexibility than public cloud services. [Randy Bias of Cloudscaling was going to give this talk but had to pull out of the trip due to other commitments].

To broaden our perspective somewhat, and get our hands dirty with real code, the next talk is a live demonstration by Ido Flatow, Building secured, scalable, low-latency web applications with the Windows Azure Platform.

In this session we will construct a secured, durable, scalable, low-latency web application with Windows Azure - Compute, Storage, CDN, ACS, Cache, SQL Azure, Full IIS, and more. This is a no-slides presentation!

Finally I will be giving my latest update on Globally Distributed Cloud Applications at Netflix.

Netflix grew rapidly and moved its streaming video service to the AWS cloud between 2009 and 2010. In 2011 the architecture was extended to use Apache Cassandra as a backend, and the service was internationalized to support Latin America. Early in 2012 Netflix launched in the UK and Ireland, using the the combination of AWS capacity in Ireland and Cassandra to create a truly global backend service. Since then the code that manages and operates the global Netflix platform is being released as a series of open source projects at netflix.github.com (Asgard, Priam etc.). The platform is structured as a large scale PaaS, strongly leveraging advanced features of AWS to deploy many thousands of instances. The platform has primary language support for Java/Tomcat with most management tools built using Groovy/Grails and operations tooling in Python. Continuous integration and deployment tooling leverages Jenkins, Ivy/Gradle, Artifactory. This talk will explain how to build your own custom PaaS on AWS using these components.

There are many other excellent speakers at this event, which is run by the same team as the global series of QCon conferences, unfortunately, the cloud track runs at at the same time as Michael Nygard and Jez Humble on Continuous Delivery and Continuous Integration, however I'm doing another talk in the NoSQL track, (along with Martin Fowler and Coda Hale). Running Netflix on Cassandra in the Cloud.

Netflix used to be a traditional Datacenter based architecture using a few large Oracle database backends. Now it is one of the largest cloud based architectures, with master copies of all data living in Cassandra. This talk will discuss how we made the transition, how we automated and open sourced Cassandra management for tens of clusters and hundreds of nodes using Priam and Astyanax, backups, archiving and performance and scalability benchmarks.

I'm looking forward to meeting old friends, getting to know some new people, and visiting Denmark for the first time. See you there!

Adrian Cockcroft's Blog

Archive

Thursday, April 03, 2014

Public Cloud Instance Pricing Wars - Detailed Context and Analysis

Friday, November 16, 2012

Cloud Outage Reports

November 18th, 2014 - Azure Global Storage Outage

Microsoft Reports

January 10th, 2014 - Dropbox Global Outage

Dropbox Report

April 20th, 2013 - Google Global API Outage

Google Report

February 22nd, 2013 - Azure Global Outage Cert Expiry

Azure Report

December 24th, 2012 - AWS US-East Partial Regional ELB State Overwritten

AWS Service Event Report

Netflix Techblog Report

October 26th, 2012 - Google AppEngine Network Router Overload

Google Outage Report

October 22, 2012 - AWS US-East Zone EBS Data Collector Bug

AWS Outage Report

Netflix Techblog Report

June 29th 2012 - AWS US-East Zone Power Outage During Storm

AWS Outage Report

Netflix Techblog Report

June 13th, 2012 - AWS US-East SimpleDB Region Outage

AWS Outage Report

February 29th, 2012 - Microsoft Azure Global Leap-Year Outage

Azure Outage Report

August 17th, 2011 - AWS EU-West Zone Power Outage

AWS Outage Report

April 2011 - AWS US-East Zone EBS Outage

AWS Outage Report

Netflix Techblog Report

February 24th, 2010 - Google App Engine Power Outage

Google Forum Report

July 20th, 2008 - AWS Global S3 Gossip Protocol Corruption

AWS Outage Report

Saturday, June 16, 2012

Cloud Application Architectures: GOTO Aarhus, Denmark, October