Tuesday, March 26, 2013

Comment on How Netflix Is Ruining Cloud Computing

I wrote a long comment response to how-netflix-is-ruining-cloud-computing on Information Week, but they don't seem in a hurry to post it. Luckily I saved a copy so here it is:

There should be a http://techblog.netflix.com post in the next day or so that will give more context to the Cloud Prize and clarify most of the points above. However I will address some of the specific issues here.

Cloud 1.0 vs. 2.0?
I would argue that the way most people are doing cloud today is to forklift part of their existing architecture into a cloud and run a hybrid setup. That's what I would call Cloud 1.0. What Netflix has done is show how to build much more agile green field native cloud applications, which might justify being called Cloud 2.0. The specific IaaS provider used underneath, and whether you do this with public or private clouds is irrelevant to the architectural constructs we've explained.

The outages that have been mentioned were regional, they didn't apply to Netflix operations in Europe for example. Our current work is to build tooling for multi-regional support on AWS (East cosat/West coast), including the DNS management that was mentioned. This removes the failure mode with the least effort and disruption to our existing operations.

Other cloud vendors have a feature set and scale comparable to AWS in 2008-2009. We're still waiting for them to catch up. There are many promises but nothing usable for Netflix itself. However there is demand to use NetflixOSS for other smaller and simpler applications, in both public and private clouds, and Eucalyptus have demonstrated Asgard, Edda and Chaos Monkey running, and will ship soon in Eucalyptus 3.3. There are signs of interest from people to add the missing features to OpenStack, CloudStack and Google Compute so that NetflixOSS can also run on them.

You've completely missed the point of Edda. It does three important things. 1) if you run at large scale your automation will overload the cloud API endpoint, Edda buffers this information and provides a query capability for efficient lookups. 2) Edda stores a history of your config, it's a CMDB that can be used to query for what changed. 3) Edda cross integrates multiple data sources, the cloud API, our own service registry Eureka, Appdynamics call flow information and can be extended to include other data sources.

If you want to spin up 500 identical instances, having them each run Chef or Puppet after they boot creates a failure mode dependency on the Chef/Puppet service, wastes startup time, and if anything can go wrong with the install you end up with an inconsistent set of instances. By using AMInator to run Chef once at build time, there is less to go wrong at run time. It also makes red/black pushes and roll-backs trivial and reliable.

Cloud Prize
The prize includes a portability category. It's a broad category and might be won by someone who adds new language support to NetflixOSS (Erlang, Go, Ruby?) or someone who makes parts of NetflixOSS run on a broader range of IaaS options. The reality is that AWS is actually dominating cloud deployments today, so contributions that run on AWS will have the greatest utility by the largest number of people. The alternatives to AWS are being hyped by everyone else, and are showing some promise, but have some way to go.

We hope that NetflixOSS provides a useful driver for higher baseline functionality that more IaaS APIs can converge on, and move from 2008-era EC2 functionality to 2010-era EC2 functionality across more vendors. Meanwhile Netflix itself will be enjoying the benefits of 2013 AWS functionality like RedShift.


  1. Adrian--

    I read this as a "tree" response to a "forest" issue, but I'll respond with respect to both forest and trees.

    The forest is this: Netflix's cloud architecture--as seen through public talks and open source code--is fundamentally (a) so intertwined with AWS as to be essentially inseparable, and (b) significantly behind the best *general* open options for configuration management and orchestration. It also is far from "the Unix way" of having encapsulated/abstracted tools that can be interchanged with others to build a best-in-breed architecture.

    Your answer doesn't really do anything to do address this "forest" argument: you defend the complete reliance that Netflix and (most of) its tools have on AWS based upon an analytical database that is really beside the point as far as cloud architectures go. (Don't get me wrong--I think RedShift is *awesome*, but its presence is completely irrelevant to a generalized reference cloud architecture, which is the power of NetflixOSS that's so concerning).

    Your defense of AMInator and Edda (I wish you'd defend Asgard also!) is ultimately a defense of why those solutions work for Netflix and its application and current architecture--but that's not the point. Obviously you're a smart and capable architect and you have reasons for using them at Netflix. The point is that--as they stand today--they're not promoting good *generalized* application architectures. You should be promoting Chef before you promote a tool that essentially encourages people make horrible design decisions (in lieu of using Chef at all). You should be defending Netflix tools based upon standardized, reference deployments, not based on launching 500 VMs of the same exact machine which *is not exactly a common use case for the cloud*.

    Look--it's possible to write awesome and fabulous PHP code, but most PHP developers don't. One of the reasons why Netflix is now choosing Python is because the generalized Python developer writes consistent and good code. (We chose Python for the same reasons you did). But to someone who has no idea what a good cloud deployment looks like, the way AMInator sits out there--you're going to see a lot more people like the guy super-psyched to have built 25,000 AMIs over Twitter.

    The overall point of the piece is this: Netflix has a lot of power and clout in the cloud architecture world, and there are a whole lot of people looking to Netflix for guidance on how to deploy on the cloud. Netflix has made some choices (the "forest" above) that are flat-out bad choices if you take anything like a long-term approach to your cloud architecture. There is no historical precedent that you can cite as being a good example for being so intertwined with a single IT vendor. And it's way more important for people deploying on the cloud to know and understand configuration management than it is for them to have a tool that--for all intents and purposes--appears to have been built to bypass it.


  2. Adrian--

    Here's the tree response.

    Cloud 1.0 vs. 2.0:
    You say, "The specific IaaS provider used underneath, and whether you do this with public or private clouds is irrelevant to the architectural constructs we've explained." But of course, the Netflix contest has to do with your tools, not "architectural constructs" per se. And your tools are absolutely tied to a specific IaaS provider. And from an architectural perspective, while Netflix has done some awesome things, I think Zynga's architecture (including what they pay for what they get) is much more likely to be what best-practice enterprise cloud architectures look like in 5+ years. I don't know many cloud architects who are aware of the difference between Zynga and Netflix who would pick Netflix's implementation over Zynga's--again, largely because of the multi-cloud capabilities.

    My point in bringing up the outages was not to imply that they were international or fatal; it was to point out that Netflix's cloud architecture is not perfect (not that any cloud architecture is, but always good to point this out), and that one can tie at least one outage to Netflix's specific architectural decision to embrace a proprietary service (ELB) when other, non-proprietary, more resilient options were available. I'm happy that Netflix won't repeat an outage due to that specific proprietary service, but the overall philosophy of choosing AWS services over open options that are more flexible and more resilient remains.

    As long as you continue to force Netflix to use new and expanded Amazon-provided services over other options, you're creating a moving target that no other vendor will hit. The better path to take is: how should organizations design their architectures so that they can maintain portability and interoperability across multiple vendors, whereever possible? If, today, we wait for Google Compute Engine to be more widely available and tested, then shouldn't we be moving to abstraction layers for API communication, instead of doubling down and adopting more and more proprietary APIs? Perhaps AWS will release their API to the world and allow all businesses to use it openly, but they haven't yet, and so it's a very risky move to bet an architecture on AWS and any vendors (e.g., Eucalyptus) that AWS will bless. Perhaps what you're saying is that you'd be happy to see abstraction layers in the Netflix tools for working with other clouds, but *you're not actually saying that*. Please say that.

    Thanks for the details on Edda; my knowledge was from reading what was easily available on github: "Why did we create Edda? ... if we see a host with an EC2 hostname that is causing problems on one of our API servers then we need to find out what that host is and what team is responsible, Edda allows us to do this." Edda sounds like a great tool to take multi-cloud--would you considering suggesting that as a theoretically good project for the contest (no guarantees on prizes)?

    Your explanation of using Chef with AMInator makes a lot of sense in the "500 simultaneous instances" use case. Which is--you would admit--not a common circumstance amongst the people who use/will be using your cloud tools. And unfortunately, your first happy user of AMInator (on Twitter, at least) made over 25,000 Ubuntu AMIs with it--can you tell me why that would ever be a good architectural decision? AMInator strikes me as a tool like PHP or a GOTO statement--there are places where you should probably use them, but it's hard to argue that they should be part of any kind of "best practices" decision.


  3. [continued]

    Cloud Prize:
    The fact that only one out of ten prizes involves portability, and the fact that you take such an expansive view of portability to include adding language support to an existing tool (which has NOTHING to do with cloud portability!), shows that you really think that cloud portability unimportant to Netflix. If Netflix wants to make that business decision, then fine. But I would argue that Netflix is a role model in the world, and has a lot of ears, and that it's just irresponsible for Netflix to lead the rest of the world on the same path.

    To the extent that Netflix is trying to exploit open source in the same way many companies do--to share code in exchange for getting additional development for free--I have no issues. Go for it. But I have a problem in the way that Netflix's tools and architectural decisions are taken as THE reference architecture. I write here, and I wrote the piece, not to try to convince you, Adrian, to change the way Netflix does things. I would like you to the run the contest in such a way that it promotes portability and interoperability and make the judging panel less AWS-centric. But beyond that, I'm really writing these for those people out there considering whether Netflix's cloud architecture is something they should copy verbatim. (Don't!)

    Heidi Roizen, an entrepreneur-turned-VC, put it this way: "I don't ask 'what happens if?' ... I ask 'what happens when?'"--meaning that there are certain things that we know will happen, and if we aren't thinking about them and planning for them, we're not thinking strategically enough. (One of her examples: "what happens when we have self-driving cars?") It is a certainty that we will have viable IaaS competitors to AWS. But the attitude that is embedded in the Netflix cloud tools--and from what I see of the contest today--is one that essentially says, "we will look nowhere but AWS." And there is no thing that has been said in response to my piece that says otherwise. In fact, you don't even address my quoting of you in different places where you have excluded other options by fiat, regardless of price or functionality.

    And that is the problem.


  4. Thanks Joe,

    I replied to these points at the original article.

  5. Joe: you say:
    Netflix's cloud architecture--as seen through public talks and open source code--is fundamentally (a) so intertwined with AWS as to be essentially inseparable, and (b) significantly behind the best *general* open options for configuration management and orchestration.

    (a) is trivially true because all of the alternative architectures are (functionally) subsets of what AWS offers. We can ignore the trivial efforts to rework Amazon's query-API pattern into strict REST: this adds nothing except complexity for the developer. And the functional inadequacy of the competitors to AWS is so extreme (sorry Peder, Marten et al) that Netflix would have been forced to ignore (or, more likely, re-invent) large chunks of AWS functionality just to get their work done. That would be perverse.

    But that's a minor point. My biggest beef is with point (b). Amazon and Netflix are dramatically ahead of the curve, not behind it. The configuration management pattern you seem to prefer - just-in-time customization using Chef or Puppet - was pretty old school when Sun acquired CenterRun and built out N1 and Grid Engine. It's incredibly inefficient compared with early-bound EBS-backed AMIs.

    Arguably all interesting advances in computer science and software engineering occur when a resource that was previously scarce or expensive becomes cheap and plentiful. We've seen it with graphical user interfaces, interpreted languages, distributed storage, and SOA. Traditional late-bound configuration management treats machine images and VM instances as expensive; AWS and Netflix invite you to imagine the possibilities if they're effectively free. Welcome to the real Cloud 2.0...

    1. Reposted with a little more detail here: http://geoffarnold.com/?p=4349

  6. Geoff--

    Let me reiterate: my concern is not to say that Netflix should be doing anything differently with its own internal use of the cloud; in both the original piece and in my responses to comments, I have been clear that the issue is NOT "how should Netflix run its business".

    My concern is about Netflix's cloud tools and their embrace by novices to the cloud. Adrian's attitude (as pointed out by several quotes raised in my article) toward not looking at other vendors *period* and making it clear he thinks that no other vendors will ever get there, which is so baked into the code base today, is not good for the average business/enterprise from a risk-management, high availability, or cost perspective going forward. Again, the main question is not, "Does Netflix run its cloud properly?" or even "Does Netflix have anything worthwhile to tell people about the cloud?" The main question is: "Is Netflix using its position of power in a way that will lead new adopters of the cloud down the best path?"

    There's a secondary question here as well, which is, "When designing cloud architectures, for what should we optimize?" Again, the key here is to ask this question for the average enterprise, not for Netflix's specific use case. When we're talking about movement to the cloud, the biggest challenge for many enterprises is the loss of control for IT, and so high availability and cost are much more relevant and salient issues than "can we launch 5,000 VMs on demand in six different continents within 10 minutes."

    Both of these questions are asked in light of a Cloud Prize contest which is so attached to the hip to AWS and so fundamentally uninterested in mitigating various cloud risks by having the ability to use other vendors that I feel that I have stumbled into a religious debate, not a debate on the best business use of the cloud.

    Instead of just asserting, ad nauseum, that "no other cloud provider can possibly provide as much as AWS", can you actually back that up with some specific examples *that talk about standard enterprise cloud use cases*? Can we have a discussion about the negative risks that Netflix's cloud adoption leaves it open to?

    Or is this not a marketplace of ideas, and instead a religious debate?

  7. Straw-man alert: nobody is claiming that "no other cloud provider can possibly provide as much as AWS". The point is that nobody is presently doing so, and that until they do it would be absurd for Netflix to "dumb down" their architecture (and their contest) to avoid AWS-only features.

    Furthermore I'm not aware of anyone (at Netflix or elsewhere) saying that one size fits all. The Netflix technologies are a great fit for scale-out webscale businesses. The requirements for enterprise private clouds, with compliance and legacy interoperability constraints, are quite different. Frankly, this area is far less developed, and the companies involved are naturally less willing to go public with their immature solutions. That will change over time. Until it does, it seems perverse to castigate Netflix for focussing on their own needs.

  8. I'm not castigating Netflix for focusing on its own needs; I'm castigating Netflix for not taking its position of power and thinking about the impact of wholesale selling Netflix's implementation of how to work on the cloud as a reference architecture. (Isn't the contest called, "Fix the Cloud" as opposed to "Help Netflix Improve Its Cloud Tools"?)

    Netflix has the ability to dramatically help the overall cloud computing community by helping to promote and push for practices that will increase the likelihood that new cloud deployments are maximally successful for the long term. For example:

    1. Netflix could use its position of influence with AWS to push AWS to open the AWS API as a standard, as opposed to leaving it in this gray FUD area that keeps other IaaS providers from using it/supporting it for customer use. Why wouldn't Adrian push for this? It would immediately remove the largest obstacle to multi-cloud interactions with NetflixOSS.

    2. Netflix could make it clearer (GitHub pages, contest descriptions, etc) that Netflix's decision to integrate so tightly with AWS is predicated upon Netflix's specific use case and AWS's unique current position in filling that special use case, and that adopting a single-vendor approach like Netflix may not be in other companies' best interests.

    3. Netflix could build abstraction layers into its tools and encourage committers to demonstrate replacements. Start with SQS (abstract to interact with RabbitMQ). Then SimpleDB (Mongo?). Or make it a bigger part of the contest. The idea that work like this *might* get one-tenth of the prize money absolutely shows the limited interest Netflix has in promoting portability.

    Ultimately, I'm trying to lobby Netflix to acknowledge that with great power comes great responsibility. You have great power because you've done some amazing things on the cloud. I think you have a responsibility to those novices out there to give them good, generalized advice on how to deploy on the cloud. If you have a more libertarian view--that you're going to act in your best interest, and that it really doesn't matter what else happens to others--then you absolutely have the right. But I'm not going to let you off the hook.

  9. On point 1. you are making an unwarranted assumption. I personally asked Andy Jassy that question two years ago and got the same answer that he gives everyone, which is that you don't standardize a moving target. They are focused on evolving their APIs to meet their customers needs. What they have done since then to address this issue is to create a mechanism where other vendors can license the API and its test suites, and Eucalyptus has done that. This meets the business need, and you should be asking Openstack and Cloudstack why they haven't licenced the API, not complaining about AWS and Netflix. So far you have ignored variations on this answer about three times, and are falsely claiming that other IaaS providers do not support the AWS API, so clearly you just don't want to hear it.

    2. We don't need to tell people how to evaluate technologies. Each of our projects clearly says what it does. About half the projects have nothing to do with AWS. You may as well complain that we've written everything in Java, and we're setting a bad example because we should have used another language.

    3. We do have abstractions. You are also ignoring all the Astyanax/Cassandra work we've done. There are remnants of SimpleDB in the code, but they are being replaced.

    We do think we are giving good generalized advice, but we aren't giving the advice you like. You're coming across as if the Openstack/Chef approaches are the only valid way to do things. The world isn't that black and white.

  10. Thank you for your answers. Of note, we run on AWS and CloudStack with Chef.

    On #1--I have not seen any answer that explained (a) you see the issues with the AWS API, and (b) you would support and did advocate for opening the API. Those are both new pieces of information to me.

    That said, it's really hard to understand why Amazon can't just agree not to sue anyone who wants to implement the AWS API. That has absolutely nothing to do with whether it remains under development; it's a business decision. And while it's entirely within Amazon's right to hold copyright protection in the API, I would still argue that it's in your best interest--and Netflix's best interest--to lobby Amazon to make a "we will not sue" pledge.

    When I say that other IaaS providers don't support the AWS API, I'm talking about large public IaaS providers (e.g., Microsoft, Google, Rackspace, Savvis, SoftLayer). I think you may have a different definition of "IaaS provider". When I use the term, I mean a company that providers the servers, power, electricity, network, and the platform for launching/managing the VMs (e.g. CloudStack). If you're aware of a large public IaaS provider (my definition) that supports the AWS API, please tell me--I'd love to know. I'm not aware of one (and thus am not consciously "falsely claiming" anything here).

    I think our disagreement about #2 really has to do with how you view Netflix and how I view Netflix. I suspect you live and work with incredibly bright and talented people who deeply care about best practices and stay educated. The rest of the world is not so lucky. The vast majority of cloud application deployments that I've seen--and I've seen more than 100--are just ports/recreations of images to AWS US-East. There is no configuration management and no orchestration. You and your team can look at configuration management as "so 2008", but let me tell you--the vast majority of web applications out there will be lucky to be using configuration management by 2018.

    So Netflix's cloud architecture stands as a great way to solve the Netflix use case. But there are going to be a lot of developers out there who embrace the Netflix tools as their toolkit to the cloud, and by doing so, will be embracing (a) Netflix's business decision (embraced in its code) to sole-source with AWS, and (b) will end up dealing with a lot of architectural decisions made specifically for Netflix's needs. My question to you is: do you care if people who shouldn't be using the Netflix tools (for business and/or tech reasons) build their entire cloud architecture around them?
    Picking the Netflix tools is significantly different from selecting a programming language: (1) choosing a programming language doesn't lock you into a business relationship with a single large public-cloud vendor ; (2) cloud architecture is new, and most people are doing it very badly, so they don't know how to make a proper evaluation; most organizations are much better equipped to evaluate a programming language choice.

    On #3--I think it's great that you're replacing the remnants of SimpleDB in the code (again, this is the first place that I've seen you or anyone on your team mention this--and the wiki pages and source code I've read through didn't address that). Are you doing the same with SQS? Have you drawn a line at the core pieces of AWS that any vendor needs to provide? (S3+EC2 [including EBS and security groups]?)

    I think the only fundamental disagreement we have is that I think that the right business decision is to get capable of going multi-cloud as soon as possible--and I don't mean buying Eucalyptus and standing up your own data center. I mean having a large vendor (e.g., Rackspace, Microsoft, Google, IBM) that you can shift loads to within a relatively short amount of time (weeks). From a risk and cost management standpoint, this is a no-brainer.

  11. Other large-ish vendors: Rackspace == Openstack, they should implement the missing features and license the AWS API as discussed already. Microsoft have barely any support for Linux and no large memory instance types. Google are still in closed beta, and IBM etc. are just nowhere in this space. It's a no-brainer that we should wait until something credible appears. All these vendors are welcome to add the missing features and port our code, and some of them are already talking about doing that.


Note: Only a member of this blog may post a comment.