Monday, March 19, 2012

Ops, DevOps and PaaS (NoOps) at Netflix

There has been a sometimes heated discussion on twitter about the term NoOps recently, and I've been quoted extensively as saying that NoOps is the way developers work at Netflix. However, there are teams at Netflix that do traditional Operations, and teams that do DevOps as well. To try and clarify things I need to explain the history and current practices at Netflix in chunks of more than 140 characters at a time.

When I joined Netflix about five years ago, I managed a development team, building parts of the web site. We also had an operations team who ran the systems in the single datacenter that we deployed our code to. The systems were high end IBM P-series virtualized machines with storage on a virtualized Storage Area Network. The idea was that this was reliable hardware with great operational flexibility so that developers could assume low failure rates and concentrate on building features. In reality we had the usual complaints about how long it took to get new capacity, the lack of consistency across supposedly identical systems, and failures in Oracle, in the SAN and the networks, that took the site down too often for too long.

At that time we had just launched the streaming service, and it was still an experiment, with little content and no TV device support. As we grew streaming over the next few years, we saw that we needed higher availability and more capacity, so we added a second datacenter. This project took far longer than initial estimates, and it was clear that deploying capacity at the scale and rates we were going to need as streaming took off was a skill set that we didn't have in-house. We tried bringing in new ops managers, and new engineers, but they were always overwhelmed by the fire fighting needed to keep the current systems running.

Netflix is a developer oriented culture, from the top down. I sometimes have to remind people that our CEO Reed Hastings was the founder and initial developer of Purify, which anyone developing serious C++ code in the 1990's would have used to find memory leaks and optimize their code. Pure Software merged with Atria and Rational before being swallowed up by IBM. Reed left IBM and formed Netflix. Reed hired a team of very strong software engineers who are now the VPs who run developer engineering for our products. When we were deciding what to do next Reed was directly involved in deciding that we should move to cloud, and even pushing us to build an aggressively cloud optimized architecture based on NoSQL. Part of that decision was to outsource the problems of running large scale infrastructure and building new datacenters to AWS. AWS has far more resources to commit to getting cloud to work and scale, and to building huge datacenters. We could leverage this rather than try to duplicate it at a far smaller scale, with greater certainty of success. So the budget and responsibility for managing AWS and figuring out cloud was given directly to the developer organization, and the ITops organization was left to run its datacenters. In addition, the goal was to keep datacenter capacity flat, while growing the business rapidly by leveraging additional capacity on AWS.

Over the next three years, most of the ITops staff have left and been replaced by a smaller team. Netflix has never had a CIO, but we now have an excellent VP of ITops Mike Kail (@mdkail), who now runs the datacenters. These still support the DVD shipping functions of Netflix USA, and he also runs corporate IT, which is increasingly moving to SaaS applications like Workday. Mike runs a fairly conventional ops team and is usually hiring, so there are sysadmin, database,, storage and network admin positions. The datacenter footprint hasn't increased since 2009, although there have been technology updates, and the over-all size is order-of-magnitude a thousand systems.

As the developer organization started to figure out cloud technologies and build a platform to support running Netflix on AWS, we transferred a few ITops staff into a developer team that formed the core of our DevOps function. They build the Linux based base AMI (Amazon Machine Image) and after a long discussion we decided to leverage developer oriented tools such as Perforce for version control, Ivy for dependencies, Jenkins to automate the build process, Artifactory as the binary repository and to construct a "bakery" that produces complete AMIs that contain all the code for a service. Along with AWS Autoscale Groups this ensured that every instance of a service would be totally identical. Notice that we didn't use the typical DevOps tools Puppet or Chef to create builds at runtime. This is largely because the people making decisions are development managers, who have been burned repeatedly by configuration bugs in systems that were supposed to be identical.

By 2012 the cloud capacity has grown to be order-of-magnitude 10,000 instances, ten times the capacity of the datacenter, running in nine AWS Availability zones (effectively separate datacenters) on the US East and West coast, and in Europe. A handful of DevOps engineers working for Carl Quinn (@cquinn - well known from the Java Posse podcast) are coding and running the build tools and bakery, and updating the base AMI from time to time. Several hundred development engineers use these tools to build code, run it in a test account in AWS, then deploy it to production themselves. They never have to have a meeting with ITops, or file a ticket asking someone from ITops to make a change to a production system, or request extra capacity in advance. They use a web based portal to deploy hundreds of new instances running their new code alongside the old code, put one "canary" instance into traffic, if it looks good the developer flips all the traffic to the new code. If there are any problems they flip the traffic back to the previous version (in seconds) and if it's all running fine, some time later the old instances are automatically removed. This is part of what we call NoOps. The developers used to spend hours a week in meetings with Ops discussing what they needed, figuring out capacity forecasts and writing tickets to request changes for the datacenter. Now they spend seconds doing it themselves in the cloud. Code pushes to the datacenter are rigidly scheduled every two weeks, with emergency pushes in between to fix bugs. Pushes to the cloud are as frequent as each team of developers needs them to be, incremental agile updates several times a week is common, and some teams are working towards several updates a day. Other teams and more mature services update every few weeks or months. There is no central control, the teams are responsible for figuring out their own dependencies and managing AWS security groups that restrict who can talk to who.

Automated deployment is part of the normal process of running in the cloud. The other big issue is what happens if something breaks. Netflix ITops always ran a Network Operations Center (NOC) which was staffed 24x7 with system administrators. They were familiar with the datacenter systems, but had no experience with cloud. If there was a problem, they would start and run a conference call, and get the right people on the call to diagnose and fix the issue. As the Netflix web site and streaming functionality moved to the cloud it became clear that we needed a cloud operations reliability engineering (CORE) team, and that it would be part of the development organization. The CORE team was lucky enough to get Jeremy Edberg (@jedberg - well know from running Reddit) as its initial lead engineer, and also picked up some of the 24x7 shift sysadmins from the original NOC. The CORE team is still staffing up, looking for Site Reliability Engineer skill set, and is the second group of DevOps engineers within Netflix. There is a strong emphasis on building tools too make as much of their processes go away as possible, for example they have no run-books, they develop code instead,

To get themselves out of the loop, the CORE team has built an alert processing gateway. It collects alerts from several different systems, does filtering, has quenching and routing controls (that developers can configure), and automatically routes alerts either to the PagerDuty system (a SaaS application service that manages on call calendars, escalation and alert life cycles) or to a developer team email address. Every developer is responsible for running what they wrote, and the team members take turns to be on call in the PagerDuty rota. Some teams never seem to get calls, and others are more often on the critical path. During a major production outage con call, the CORE team never make changes to production applications, they always call a developer to make the change. The alerts mostly refer to business transaction flows (rather than typical operations oriented Linux level issues) and contain deep links to dashboards and developer oriented Application Performance Management tools like AppDynamics which let developers quickly see where the problem is at the Java method level and what to fix,

The transition from datacenter to cloud also invoked a transition from Oracle, initially to SimpleDB (which AWS runs) and now to Apache Cassandra, which has its own dedicated team. We moved a few Oracle DBAs over from the ITops team and they have become experts in helping developers figure out how to translate their previous experience in relational schemas into Cassandra key spaces and column families. We have a few key development engineers who are working on the Cassandra code itself (an open source Java distributed systems toolkit), adding features that we need, tuning performance and testing new versions. We have three key open source projects from this team available on github.com/Netflix. Astyanax is a client library for Java applications to talk to Cassandra, CassJmeter is a Jmeter plugin for automated benchmarking and regression testing of Cassandra, and Priam provides automated operation of Cassandra including creating, growing and shrinking Cassandra clusters, and performing full and incremental backups and restores. Priam is also written in Java. Finally we have three DevOps engineers maintaining about 55 Cassandra clusters (including many that span the US and Europe), a total of 600 or so instances. They have developed automation for rolling upgrades to new versions, and sequencing compaction and repair operations. We are still developing our Cassandra tools and skill sets, and are looking for a manager to lead this critical technology, as well as additional engineers. Individual Cassandra clusters are automatically created by Priam, and it's trivial for a developer to create their own cluster of any size without assistance (NoOps again). We have found that the first attempts to produce schemas for Cassandra use cases tend to cause problems for engineers who are new to the technology, but with some familiarity and assistance from the Cloud Database Engineering team, we are starting to develop better common patterns to work to, and are extending the Astyanax client to avoid common problems.

In summary, Netflix stil does Ops to run its datacenter DVD business. we have a small number of DevOps engineers embedded in the development organization who are building and extending automation for our PaaS, and we have hundreds of developers using NoOps to get their code and datastores deployed in our PaaS and to get notified directly when something goes wrong. We have built tooling that removes many of the operations tasks completely from the developer, and which makes the remaining tasks quick and self service. There is no ops organization involved in running our cloud, no need for the developers to interact with ops people to get things done, and less time spent actually doing ops tasks than developers would spend explaining what needed to be done to someone else. I think that's different to the way most DevOps places run, but its similar to other PaaS enviroments, so it needs it's own name, NoOps. [Update: the DevOps community argues that although it's different, it's really just a more advanced end state for DevOps, so lets just call it PaaS for now, and work on a better definition of DevOps].

22 comments:

  1. Looks like your comments can't be more than 4,096 chars on this blog. :)

    So here's my comment:
    https://gist.github.com/2140086

    ReplyDelete
    Replies
    1. Thanks John. I agree with some of what you point out. Netflix in effect over-reacted to a dysfunctional ops organization. I think there are several other organizations who would recognize our situation, and would also find a need to over-react to make the solution stick.

      Your definition of DevOps seems far broader than the descriptions and definitions I can find by googling or looking on Wikipedia. I don't recognize what we do in those definitions - since they are so focused on the relationship between a Dev org and an Ops org, so someone should post an updated definition to Wikipedia or devops.com. Until then maybe I'll just call it NetflOps or botchagalops.

      Delete
  2. I have a loaded NoOps question for you :) I am very interesting in understanding how a decentralized environments he said-she said issues get solved. For example, I know netflix uses horizontally scalable rest layers as integration points.
    Suppose one team/application is having an intermittent problem/bug with another team/application. Team 1 opens a issue. Team 2 reads investigates closes the issue as not a problem. Team 1 double checks and reasserts the issue is Team 2.

    In a decentralized environment how is this road block cleared?

    As an ops person I spend a good deal of time chasing down problems very external to me. I accept this as an ops person. Since developers are pressed into ops how much time will a developer spend on another teams reported problems. Will team 1 forgo there own scrum deadlines this week because team 2 sucks up all their time reporting bogus problems?

    ReplyDelete
  3. We aren't decentralized. So in the scenario you mention everyone gets in a room and figures it out, or we just end up with a long email thread if it's less serious. APM tools help pinpoint what is going on at the request level down to Java code. Once we have root cause someone files a Jira to fix the issue. There is a manager rotation for centrally prioritizing and coordinating response to major outages. (I'm on duty this week, it comes up every few months.) We have a few people who have "mad wireshark skills" to debug network layer problems, but thats infrequent and I'm hopeful that boundary.com will come up with better tools in this space.

    We don't follow a rigid methodology or fixed release deadlines, we ship code frequently enough that delays aren't usually a big issue, and we have a culture of responsible adults so we communicate and respect each others needs across teams. The infrequent large coordinating events like a new country launch are dealt with by picking one manager to own the overall big picture and lead the coordination.

    ReplyDelete
    Replies
    1. My views on culture may not be much help - read http://perfcap.blogspot.com/2011/12/how-netflix-gets-out-of-way-of.html to see my explanation of how Netflix does things differently.

      Culture is very hard to create or modify but easy to destroy. This is because everyone has to buy into it for it to be effective, and then every manager has to hire only people who are compatible with the culture, and also get rid of people who turn out not to fit in, even if they are doing good work.

      So the short answer is start a new company from scratch with the culture you want, and pay a lot of attention to who you hire. I don't think it is possible to do a culture shift if there are more than a roomful of people involved.

      Delete
  4. Part of getting a "culture of responsible adults" together is partly down to "culture" - although it helps to have mature sensible individuals, fostering that also means avoiding finger-pointing and blame.

    The more defensive people are made to feel, the more likely they are to start throwing tantrums when under pressure. A culture where you can put your hand up and say: "I got that wrong, how can we put it right?" gets better results in the long term than one where you might be fired or disciplined for a genuine mistake.

    I always wondered how evil geniuses like Ernst Blofeld recruit when getting it wrong means you might end up in the shark tank...

    ReplyDelete
    Replies
    1. Yes, incident reviews that don't involve blame and finger-pointing are also key. Making the same mistake several times, trying to hide your mistakes, or clear lapses of judgement can't be tolerated though.

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Great article Adrian. I have a question. Is a consecutive IP space important? Since AWS EIP doesn't guarantee consecutive addresses, I've wondered if this mattered to app developers.

    Anything that could have been done by subnet is out the window. For example if you wanted to do port sweeps of your network blocks for an audit, perform penetration testing, or parsing logs by IPs. I suppose this could be done pragmatically but was curious about your experiences. Does it matter?

    ReplyDelete
    Replies
    1. If the network topology matters you can use VPC to manage it. Also if you are a big enough customer to have an enterprise level support contract with AWS and use a lot of EIPs it is possible to get them allocated in contiguous blocks.

      Delete
  7. Anonymous9:10 AM

    Some thoughts by me on the topic of NoOps and DevOps and the future of operations and the need for operations http://imansson.wordpress.com/2012/03/21/35/

    ReplyDelete
    Replies
    1. I added a comment to your blog. Thanks!

      Delete
  8. I'm curious about your statement "Notice that we didn't use the typical DevOps tools Puppet or Chef to create builds at runtime. This is largely because the people making decisions are development managers, who have been burned repeatedly by configuration bugs in systems that were supposed to be identical."

    If systems are built from the same puppet manifests, what kind of configuration bugs can occur? Also, how is the alternative method you choose any less likely to cause the same problems?

    ReplyDelete
    Replies
    1. Puppet is overkill for what we end up doing. We have a base AMI that is always identical, we install an rpm on it or untar a file once, rebake the AMI and we are done. That AMI is then turned into running systems using the AWS autoscaler. It's more reliable because there are fewer things to go wrong at boot time, no dependencies on repo's or orchestration tools, we know that the install had completely succeeded the same way on every instance before the instance booted.

      Delete
  9. Adrian, Last time we talked you mentioned that you were not 100% transitioned to AWS, but were - looks like this was achieved, with the exception for the core DVD Business.
    As many companies are talking about Hyrbrid cloud as a target state, and Netflix went through the transition from Pvt/Managed Ops to Public-AWS/NoOps, can you talk about the interim state you were in and what development and operations use cases you optimized around during transition – ala, what were the Hybrid models and principles you all followed – to move from Ops to NoOps. Did you attempt to keep these teams and tools separate or did you all try to create a transition strategy that allowed you to hedge the bets with AWS and the 'all-in strategy" to possibly come back if needed etc..
    What Ops Governance and Dev Tooling approach did you take in the state? Specifically around cloud abstraction layers to ease the Access management, support, tooling, elasticity needs. Can you shed some light on the thinking and approach you took while you were in mid-state of transition?
    Also can you comment on the how much do you govern and drive the development and deployment approach so you can unify the continuous Integration and Continuous deployment tools so that you can reduce the chaos in this space?
    Tim Fraser

    ReplyDelete
    Replies
    1. If you look at the presentations on slideshare.net/adrianco I have discussed in some detail what the transition strategy and tools looked like. We continued to run the old code in the datacenter as we gradually moved functionality away from it. The old DC practices were left intact as we moved developers to the cloud one group at a time.

      Delete
  10. Hi Adrian,

    It seems to me that rather than NoOps, this is an outsourcing of ops.

    I assume that the issues you had with Ops came from the need to control cost, scale etc? If not, please would you clarify?

    If it was, how has moving to AWS solved your problems? Is it that AWS can scale more readily based on experience to date? Was cost control an issue before, and is it now, in terms of Ops
    costs?

    Given that NoOps = outsourcing, I can see that managing the relationship with the service provider becomes vital for you?

    Thanks in advance, interesting stuff!
    Jonathan Coles

    ReplyDelete
  11. Dev and admin teams struggle these days with keeping up with agile development. DevOps helps by breaking down some of the walls. But one of the biggest challenges is getting the entire development team involved and not just 1 or 2 people who help do deployments. The entire team needs visibility to the production server environments to help support and troubleshoot applications.

    ReplyDelete
  12. you have explained crystal clearly about the technology Cassandra i learned more new things for this post thank you..

    ReplyDelete
  13. I have learned so much from your post Adrian. I would definitely bookmark your site to be updated with your upcoming articles. Great job! So much information.

    woertz

    ReplyDelete
  14. Very informative, although there is some controversy on the use of terminology.

    Interestingly, I found throwing out these controversial points triggers me to think deeper and makes the article more interesting to read.

    Great job, really useful to me.

    ReplyDelete
  15. This is a great article when it comes to a single service used by consumers, You are not tied to a in between Customer per se. I am dealing with very similar but we have customers to deal with in IoT space, We cannot afford deploy and un-deploy code easily. We have to work with Customer dev team to make sure things work. At the same time we have the similar classical issues of ITops bottleneck, Any suggestions

    ReplyDelete

Note: Only a member of this blog may post a comment.