Monday, March 19, 2012

Ops, DevOps and PaaS (NoOps) at Netflix

There has been a sometimes heated discussion on twitter about the term NoOps recently, and I've been quoted extensively as saying that NoOps is the way developers work at Netflix. However, there are teams at Netflix that do traditional Operations, and teams that do DevOps as well. To try and clarify things I need to explain the history and current practices at Netflix in chunks of more than 140 characters at a time.

When I joined Netflix about five years ago, I managed a development team, building parts of the web site. We also had an operations team who ran the systems in the single datacenter that we deployed our code to. The systems were high end IBM P-series virtualized machines with storage on a virtualized Storage Area Network. The idea was that this was reliable hardware with great operational flexibility so that developers could assume low failure rates and concentrate on building features. In reality we had the usual complaints about how long it took to get new capacity, the lack of consistency across supposedly identical systems, and failures in Oracle, in the SAN and the networks, that took the site down too often for too long.

At that time we had just launched the streaming service, and it was still an experiment, with little content and no TV device support. As we grew streaming over the next few years, we saw that we needed higher availability and more capacity, so we added a second datacenter. This project took far longer than initial estimates, and it was clear that deploying capacity at the scale and rates we were going to need as streaming took off was a skill set that we didn't have in-house. We tried bringing in new ops managers, and new engineers, but they were always overwhelmed by the fire fighting needed to keep the current systems running.

Netflix is a developer oriented culture, from the top down. I sometimes have to remind people that our CEO Reed Hastings was the founder and initial developer of Purify, which anyone developing serious C++ code in the 1990's would have used to find memory leaks and optimize their code. Pure Software merged with Atria and Rational before being swallowed up by IBM. Reed left IBM and formed Netflix. Reed hired a team of very strong software engineers who are now the VPs who run developer engineering for our products. When we were deciding what to do next Reed was directly involved in deciding that we should move to cloud, and even pushing us to build an aggressively cloud optimized architecture based on NoSQL. Part of that decision was to outsource the problems of running large scale infrastructure and building new datacenters to AWS. AWS has far more resources to commit to getting cloud to work and scale, and to building huge datacenters. We could leverage this rather than try to duplicate it at a far smaller scale, with greater certainty of success. So the budget and responsibility for managing AWS and figuring out cloud was given directly to the developer organization, and the ITops organization was left to run its datacenters. In addition, the goal was to keep datacenter capacity flat, while growing the business rapidly by leveraging additional capacity on AWS.

Over the next three years, most of the ITops staff have left and been replaced by a smaller team. Netflix has never had a CIO, but we now have an excellent VP of ITops Mike Kail (@mdkail), who now runs the datacenters. These still support the DVD shipping functions of Netflix USA, and he also runs corporate IT, which is increasingly moving to SaaS applications like Workday. Mike runs a fairly conventional ops team and is usually hiring, so there are sysadmin, database,, storage and network admin positions. The datacenter footprint hasn't increased since 2009, although there have been technology updates, and the over-all size is order-of-magnitude a thousand systems.

As the developer organization started to figure out cloud technologies and build a platform to support running Netflix on AWS, we transferred a few ITops staff into a developer team that formed the core of our DevOps function. They build the Linux based base AMI (Amazon Machine Image) and after a long discussion we decided to leverage developer oriented tools such as Perforce for version control, Ivy for dependencies, Jenkins to automate the build process, Artifactory as the binary repository and to construct a "bakery" that produces complete AMIs that contain all the code for a service. Along with AWS Autoscale Groups this ensured that every instance of a service would be totally identical. Notice that we didn't use the typical DevOps tools Puppet or Chef to create builds at runtime. This is largely because the people making decisions are development managers, who have been burned repeatedly by configuration bugs in systems that were supposed to be identical.

By 2012 the cloud capacity has grown to be order-of-magnitude 10,000 instances, ten times the capacity of the datacenter, running in nine AWS Availability zones (effectively separate datacenters) on the US East and West coast, and in Europe. A handful of DevOps engineers working for Carl Quinn (@cquinn - well known from the Java Posse podcast) are coding and running the build tools and bakery, and updating the base AMI from time to time. Several hundred development engineers use these tools to build code, run it in a test account in AWS, then deploy it to production themselves. They never have to have a meeting with ITops, or file a ticket asking someone from ITops to make a change to a production system, or request extra capacity in advance. They use a web based portal to deploy hundreds of new instances running their new code alongside the old code, put one "canary" instance into traffic, if it looks good the developer flips all the traffic to the new code. If there are any problems they flip the traffic back to the previous version (in seconds) and if it's all running fine, some time later the old instances are automatically removed. This is part of what we call NoOps. The developers used to spend hours a week in meetings with Ops discussing what they needed, figuring out capacity forecasts and writing tickets to request changes for the datacenter. Now they spend seconds doing it themselves in the cloud. Code pushes to the datacenter are rigidly scheduled every two weeks, with emergency pushes in between to fix bugs. Pushes to the cloud are as frequent as each team of developers needs them to be, incremental agile updates several times a week is common, and some teams are working towards several updates a day. Other teams and more mature services update every few weeks or months. There is no central control, the teams are responsible for figuring out their own dependencies and managing AWS security groups that restrict who can talk to who.

Automated deployment is part of the normal process of running in the cloud. The other big issue is what happens if something breaks. Netflix ITops always ran a Network Operations Center (NOC) which was staffed 24x7 with system administrators. They were familiar with the datacenter systems, but had no experience with cloud. If there was a problem, they would start and run a conference call, and get the right people on the call to diagnose and fix the issue. As the Netflix web site and streaming functionality moved to the cloud it became clear that we needed a cloud operations reliability engineering (CORE) team, and that it would be part of the development organization. The CORE team was lucky enough to get Jeremy Edberg (@jedberg - well know from running Reddit) as its initial lead engineer, and also picked up some of the 24x7 shift sysadmins from the original NOC. The CORE team is still staffing up, looking for Site Reliability Engineer skill set, and is the second group of DevOps engineers within Netflix. There is a strong emphasis on building tools too make as much of their processes go away as possible, for example they have no run-books, they develop code instead,

To get themselves out of the loop, the CORE team has built an alert processing gateway. It collects alerts from several different systems, does filtering, has quenching and routing controls (that developers can configure), and automatically routes alerts either to the PagerDuty system (a SaaS application service that manages on call calendars, escalation and alert life cycles) or to a developer team email address. Every developer is responsible for running what they wrote, and the team members take turns to be on call in the PagerDuty rota. Some teams never seem to get calls, and others are more often on the critical path. During a major production outage con call, the CORE team never make changes to production applications, they always call a developer to make the change. The alerts mostly refer to business transaction flows (rather than typical operations oriented Linux level issues) and contain deep links to dashboards and developer oriented Application Performance Management tools like AppDynamics which let developers quickly see where the problem is at the Java method level and what to fix,

The transition from datacenter to cloud also invoked a transition from Oracle, initially to SimpleDB (which AWS runs) and now to Apache Cassandra, which has its own dedicated team. We moved a few Oracle DBAs over from the ITops team and they have become experts in helping developers figure out how to translate their previous experience in relational schemas into Cassandra key spaces and column families. We have a few key development engineers who are working on the Cassandra code itself (an open source Java distributed systems toolkit), adding features that we need, tuning performance and testing new versions. We have three key open source projects from this team available on github.com/Netflix. Astyanax is a client library for Java applications to talk to Cassandra, CassJmeter is a Jmeter plugin for automated benchmarking and regression testing of Cassandra, and Priam provides automated operation of Cassandra including creating, growing and shrinking Cassandra clusters, and performing full and incremental backups and restores. Priam is also written in Java. Finally we have three DevOps engineers maintaining about 55 Cassandra clusters (including many that span the US and Europe), a total of 600 or so instances. They have developed automation for rolling upgrades to new versions, and sequencing compaction and repair operations. We are still developing our Cassandra tools and skill sets, and are looking for a manager to lead this critical technology, as well as additional engineers. Individual Cassandra clusters are automatically created by Priam, and it's trivial for a developer to create their own cluster of any size without assistance (NoOps again). We have found that the first attempts to produce schemas for Cassandra use cases tend to cause problems for engineers who are new to the technology, but with some familiarity and assistance from the Cloud Database Engineering team, we are starting to develop better common patterns to work to, and are extending the Astyanax client to avoid common problems.

In summary, Netflix stil does Ops to run its datacenter DVD business. we have a small number of DevOps engineers embedded in the development organization who are building and extending automation for our PaaS, and we have hundreds of developers using NoOps to get their code and datastores deployed in our PaaS and to get notified directly when something goes wrong. We have built tooling that removes many of the operations tasks completely from the developer, and which makes the remaining tasks quick and self service. There is no ops organization involved in running our cloud, no need for the developers to interact with ops people to get things done, and less time spent actually doing ops tasks than developers would spend explaining what needed to be done to someone else. I think that's different to the way most DevOps places run, but its similar to other PaaS enviroments, so it needs it's own name, NoOps. [Update: the DevOps community argues that although it's different, it's really just a more advanced end state for DevOps, so lets just call it PaaS for now, and work on a better definition of DevOps].

Friday, March 16, 2012

Cloud Architecture Tutorial

I presented a whole day tutorial at QCon London on March 5th, and presented subsets at an AWS Meetup, a Big Data / Cassandra Meetup and a Java Meetup the same week. I updated the slides and split them into three sections which are hosted on http://slideshare.net/adrianco along with all my other presentations. You can find many more related slide decks at http://slideshare.net/Netflix and other information at the Netflix Tech Blog.

The first section tells the story of why Netflix migrated to cloud, how we think about choosing AWS as our cloud supplier and what features of the Netflix site were moved to the cloud over the last three years.



The second section is a detailed explanation of the globally distributed Java based Platform as a Service (PaaS) we built, the open source components that we depend on, and the open source projects that we have started to share at http://netflix.github.com.



The final section talks about how we run these PaaS services in the cloud, and includes details of our million writes per second scalability benchmark.



If you would like to see these slides presented in person, I'm teaching a half day cloud architecture tutorial at Gluecon in Broomfield Colorado in May 23-24th. I hope to see you there...