Started in 2004. Covers anything I find interesting, like clouds, cars and strange complex music. Views expressed are my own and not those of my employers (currently AWS). See also @adrianco
Monday, December 27, 2010
Solar Power - More panels and Nissan Leaf
Wednesday, November 17, 2010
NoSQL Netflix Use Case Comparison for Translattice
These answers are for the Translattice Application Platform (TAP)'s database component. Unlike other stores that have answered this question set, TAP contains a relational database that scales out over identical nodes. TAP further allows applications written to the J2EE platform to scale out across the same collection of nodes.
Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.
Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
In the Translattice Application Platform, data in relational tables is transparently sharded behind the scenes by the data store. These shards are stored redundantly across the nodes. Reads are satisfied with the most local copy of data available on the network, unless that resource is currently overloaded in which case the system may fall back to reads from more distant locations.When it comes to writes, applications have the choice on the durability and isolation levels for changes. Each transaction may be made in a fully synchronous, serializable isolation level, or may be made in a locked eventually consistent mode that provides ACID serializable semantics except that durability may be sacrificed if an availability zone fails. A further asynchronous mode allows potentially conflicting changes to be made and allows a user-provided reconciliation function to decide which change "wins". A final commit requires a majority of nodes storing a shard to be available; in the case of the fully synchronous mode this would delay or prevent the return of success if a critical subset of the cluster fails.Policy mechanisms in the system allow administrators to specify how physical and cloud database instances correspond to administratively-relevant zones. An administrator can choose to require, for instance, that each piece of information is replicated to at least three database nodes across a total of two availability zones. An administrator may also use these mechanisms to require that particular tables or portions of tables must or must not be stored in a given zone (for instance, to meet compliance or security requirements). Within the constraints set by policy, the system tracks usage patterns and places information in the most efficient locations.
Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
- one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
- both zones continue to satisfy reads, but refuse writes until repaired
- data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
- both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Assuming that the SQL transaction in question is running in the fully synchronous or eventually consistent locked mode, writes will only be allowed in one of the two zones. Reads will continue in both zones, but will only be able to satisfy requests for which at least one replica of the requested data exists in the local zone (policy can be specified to ensure that this is always the case). In the eventually consistent mode, multiple partitioned portions of the system can accept writes and reconcile later. Essentially, any of the above desired modes can be used on a transaction-by-transaction basis depending on application and performance requirements.
Because fully relational primitives are provided, there can easily be one row in the database per favorite. Read-modify-write of the whole list is not required, and the only practical limits are application-defined.Any SQL queries are supported against the store, and are transformed by the query planner into an efficient plan to execute the query across the distributed system. Of course, how efficient a query is to execute will depend on the structure of the data and the indexes that an administrator has created. We think this allows for considerable flexibility and business agility as the exact access methods that will be used on the data do not need to be fully determined in advance.
Network activity is protected by cryptographic hash authentication, which provides integrity verification as a side benefit. Distributed transactions also take place through a global consensus protocol that uses hashes to ensure that checkpoints are in a consistent state (this is also how the system maintains transactional integrity and consistency when changes cross many shards). Significant portions of the on-disk data are also presently protected by checksums and allow the database to "fail" a disk if corrupt data is read.
A good portion of this relational database's consistency model is implemented through a distributed multi-version concurrency control (MVCC) system. Tuples that are in-use are preserved as the database autovacuum process will not remove tuples until it is guaranteed that no one could be looking at them anymore. This allows a consistent version of the tables as of a point in time to be viewed from within a transaction; so BEGIN TRANSACTION; SELECT ... [or COPY FROM, to backup] ; COMMIT; works. We provide mechanisms to allow database dumps to occur via this type of mechanism.
In the future we are likely to use this mechanism to allow quick snapshots of the entire database and rollbacks to previous snapshot versions (as well as to allow the use of snapshots to stage development versions of application code without affecting production state).
Tuesday, November 09, 2010
NoSQL Netflix Use Case Comparison for Riak
Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.
Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
There are two possibilities with Riak. The first would be to spread a single Riak cluster across all three zones, for example one node in each of three zones. In this case, a single replica of each item would exist in each zone. Whether or not a response needed to wait on cross-zone traffic to complete would depend on the consistency level in the individual request. The second option would require Riak EnterpriseDS, and involves placing a complete cluster in each zone and configuring them to perform inter-cluster replication. This has multiple advantages. Every request would be satisfied entirely locally, and would be independent of latency or availability characteristics across zone boundaries. Another benefit is that (unlike either the first scenario or some other solutions that spread clusters and quorums over a long haul) read requests would not generate any cross-zone traffic at all. For an application with a high percentage of reads, this can make a large difference.
Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
- one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
- both zones continue to satisfy reads, but refuse writes until repaired
- data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
- both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
As write-availability is a central goal achieved in Riak, the fourth option will be the observed behavior. This is the case regardless of the strategy chosen for Question 1. In the first strategy, local nodes other than the canonical homes for given data will accept the writes instead, using the hinted-handoff technique. In the second strategy, the local cluster will accept the write, those changes will be replayed across the replication link when the zones are reconnected. In all cases, vector clocks provide a clean way of resolving most inconsistency, and various reconciliation models are available to the user for those cases which cannot be syntactically resolved.
For more information on vector clocks in Riak, see:
http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/
and
http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/
Riak will use vector clocks to recognize causality in race conditions. In the case of two overlapping writes to the same value, Riak will retain both unless explicitly requested to simply overwrite with the last value received. If one client changes A to B and another changes A to C, then (unless told to overwrite) Riak will return both B and C to the client. When that client then modifies the object again, the single descendant "D" that they created will be the new value. For applications such as sets which are mostly added to and rarely deleted from, the application code to perform this reconciliation is trivial and in some cases is simply a set union operation. This would look a bit like this in terms of vector clock ancestry:
http://dl.dropbox.com/u/751099/ndiag1.png
Riak allows values to be of any arbitrary content type, but if the content is in JSON then a JavaScript map/reduce request can be used to query by attribute/value.
Many layers of Riak perform consistency checking, including CRC checking in the persistence engine and object equality in the distributed state machines handling requests. In most cases where corruption can be detected in a given replica of some item, that replica will immediately but asynchronously be fixed via read-repair.
There are two approaches to back up Riak systems: per-node or whole-cluster. Backing up per-node is the easiest option for many people, and is quite simple. Due to bitcask (the default storage engine) performing writes in an append-only fashion and never re-opening any file for writing once closed, Riak nodes can easily be backed up via the filesystem backup method of your choice. Simply replacing the content of the data directory will reset a node's stored content to what it held at the time. Alternately, a command line backup command is available which will write out a backup of all data on the cluster. This is fairly network and disk intensive and requires somewhere to put a whole-cluster backup, but is very useful for prototyping situations which are not holding enormous amounts of data.
Monday, November 01, 2010
Are we ready for spotcloud yet?
"the evolution of a marketplace goes from competing on the basis of technology, to competing on service, to competing as a utility, to competing for free. In each step of the evolution, competitors shake out over time and a dominant brand emerges.
To use this as a maturity model, take a market and figure out whether the primary competition is on the basis of technology, service, utility or search"
Sunday, October 31, 2010
NoSQL Netflix Use Case Comparison for MongoDB
Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.
Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
Let's assume for discussion purposes that we are using MongoDB across three availability zones in a region. We would have a replica set member in each of the three zones. One member will be elected primary at a given point in time.
All writes will be sent to the primary, and then propagate to secondaries from there. Thus, writes are often inter-zone. However availability zones are fairly low latency (I assume the context here is EC2).
Reads can be either to the primary, if immediate/strong consistency semantics are desired, or to the local zone member, if eventually consistent read semantics are acceptable.
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
- one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
- both zones continue to satisfy reads, but refuse writes until repaired
- data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
- both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Let's assume again we are using three zones - we could use two but three is more interesting. To be primary in a replica set, the primary must be visible to a majority of the members of the set: in this case, two thirds of the members, or two thirds of the zones. If one zone is partitioned from the other two, what will happen is: a member in the 2 zone side of the partition will become primary, if not already. It will be available for reads and writes.Question 3: Appending a movie to the favorites list
The minority partition will not service writes. Eventually consistent reads are still possible in the minority partition.
Once the partition heals, the servers automatically reconcile.
http://www.mongodb.org/display/DOCS/Replica+Set+Design+Concepts
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions?
MongoDB supports atomic operations on single documents via both its $ operators ($set, $inc) and also by compare-and-swap operations. In MongoDB one could model the list as a document per favorite, or, put all the favorites in a single BSON object. In both cases atomic operations free of race conditions are possible. http://www.mongodb.org/display/DOCS/Atomic+Operations
This is why mongodb elects a node primary: to facilitate these atomic operations for use cases where these semantics are required.
Yes.
A single BSON document must be under the limit -- currently that limit is 8MB. If larger than this, one should consider modeling as multiple documents during schema design.
and are queries by attribute/value supported?
Yes. For performance, MongoDB supports secondary (composite) indices.
The general assumption is that the storage system is reliable. Thus, one would normally use a RAID with mirroring, or a service like EBS which has intrinsic mirroring.
However, the BSON format has a reasonable amount of structure to it. It is highly probable, although not certain, that a corrupt object would be detected and an error reported. This could then be correct with a database repair operation.
Note: the above assumes an actual storage system fault. Another case of interest is simply a hard crash of the server. MongoDB 1.6 requires a --repair after this. MongoDB v1.8 (pending) is crash-safe in its storage engine via journaling.
The most used method is to have a replica which is used for backups only; perhaps an inexpensive server or VM. This node can be taken offline at any time and any backup strategy used. Once re-enabled, it will catch back up. http://www.mongodb.org/display/DOCS/Backups
With something like EBS, quick snapshotting is possible using the fsync-and-lock command.
One can stop the server(s), restore the old data file images, and restart.
MongoDB supports a slaveDelay option which allows one to force a replica to stay a certain number of hours behind realtime. This is a good way to maintain a rolling backup in case of someone "fat-fingering" a database operation.
Friday, October 29, 2010
NoSQL Netflix Use Case Comparison for Cassandra
Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.
Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
The most natural way to model per-user favorites in Cassandra is to have one row per user, keyed by the userid, whose column names are movie IDs. The combination of allowing dynamic column creation within a row and allowing very large rows (up to 2 billion columns in 0.7) means that you can treat a row as a list or map, which is a natural fit here. Performance will be excellent since columns can be added or modified without needing to read the row first. (This is one reason why thinking of Cassandra as a key/value store, even before we added secondary indexes, was not really correct.)
The best introduction to Cassandra data modeling is Max Grinev's series on basics, translating SQL concepts, and idempotence.
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
Briefly, both reads and writes have a ConsistencyLevel parameter controlling how many replicas across how many zones must reply for the request to succeed. Routing is aware of current response times as well as network topology, so given an appropriate ConsistencyLevel, reads can be routed around temporarily slow nodes.Question 2: Partitioned Behavior with Two Zones
On writes, the coordinator node (the one the client sent the request to) will send the write to all replicas; as soon as enough success messages come back to satisfy the desired consistency level, the coordinator will report success to the client.
For more on consistency levels, see Ben Black's excellent presentation.
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
- one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
- both zones continue to satisfy reads, but refuse writes until repaired
- data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
- both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Cassandra has no 'master copy' for any piece of data; all copies are equal. The other behaviors are supported by different ConsistencyLevel values for reads (R) and writes (W):R=QUORUM, W=QUORUM: One zone decides that it is still working for reads and writes, and the other zone decides it is offlineR=ONE, W=ALL: Both zones continue to satisfy reads, but refuse writesR=ONE, W=ONE: Both zones continue to accept writes, and reconcile any inconsistencies when the partition healsI would also note that reconciliation is timestamp-based at the column level, meaning that updates to different columns within a row will never conflict, but when writes have been allowed in two partitions to the same column, the highest timestamp will win. (This is another way Cassandra differs from key/value stores, which need more complex logic called vector clocks to be able to merge updates to different logical components of a value.)
Cassandra's ColumnFamily model generally obviates the need for a read before a write, e.g., as above using movie IDs as column names. (If you wanted to allow duplicates in the list for some reason, you would generally use a UUID as the column name on insert instead of the movie ID.)The maximum value size is 2GB although in practice we recommend using 8MB as a more practical maximum. Splitting a larger blob up across multiple columns is straightforward given the dynamic ColumnFamily design. The maximum row size is 2 billion columns. Queries by attribute value are supported with secondary indexes in 0.7.
Cassandra handles repairing corruption the same way it does other data inconsistencies, with read repair and anti-entropy repair.
Because Cassandra's data files are immutable once written, creating a point-in-time snapshot is as simple as hard-linking the current set of sstables on the filesystem. Performance impact is negligible since hard links are so lightweight. Rolling back simply consists of moving a set of snapshotted files into the live data directory. The snapshot is as consistent as your ConsistencyLevel makes it: any write visible to readers at a given ConsistencyLevel before the snapshot will be readable from the snapshot after restore. The only scalability problem with snapshot management is that past a few TB, it becomes impractical to try to manage snapshots centrally; most companies leave them distributed across the nodes that created them.
Wednesday, October 27, 2010
Comparing NoSQL Availability Models
Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.
Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
- one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
- both zones continue to satisfy reads, but refuse writes until repaired
- data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
- both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?
Sunday, October 10, 2010
Netflix in the Cloud
The meetup is the "beta test" of the presentation. It's in Mountain View at "Hacker Dojo", and at the time of writing 437 people have signed up to attend. If everyone turns up it's going to be crazy and over-flowing trying to park and get in, so get there early.... I will focus the meetup talk more on the operational aspects of the cloud architecture, and migration techniques.
At QConSF, the presentation is in the "Architectures you've always wondered about" track, and I will spend more time talking about the software architecture.
Why give these talks? We aren't trying to sell cloud to CIOs, it's all about hiring, we are talking at engineer focused events in the Bay Area. Netflix is growing fast, pathfinding new technologies in the public cloud to support a very agile business model, and is trying to attract the best and brightest people. We use LinkedIn a lot when we search for positions, so feel free to connect to me if you think this could be interesting, or follow @adrianco on Twitter.
Wednesday, October 06, 2010
Time for some climate action on 10/10/10
Even if all you do is buy and start reading a copy of James Hansen's book Storms of my Grandchildren (you can get it instantly from the Kindle store) it will help.
Here is a useful discussion of "The Value of Coherence in Science" which is how you can tell what makes sense from what must be wrong. I'm a Physics graduate, and I was taught that science works by making predictions and testing them, and by eliminating incoherent propositions. i.e. if an argument contradicts itself it must be false. The denialist counter-arguments fail this test. To see why the denialist arguments are getting any air time at all, read Merchants of Doubt and Climate Cover Up.
For an example of incoherent and shoddy denialist work John Mashey has systematically deconstructed the Wegman Report which was presented to Congress as an impartial study when it was nothing of the sort.
Sunday, October 03, 2010
More solar power
The goal for next year is to only use propane to run the backup generator (usually for a few days a year), and to make a dent in our petrol usage for the daily commute. We are on the Nissan Leaf list, but not in the first wave of owners. We should be able to get a test drive soon, I will report on any progress as it happens.
If you are interested in electric cars, look for Robert Llewellyn's Fully Charged video show on YouTube. He has been test driving everything and he's good fun to watch. You may recognize him as the actor who played Kryton in Red Dwarf.
Sunday, September 26, 2010
Solar Power - The Year in Review
We installed Solar panels in August 2009, and they were turned on during September. We have now had a whole year of power which is shown below. The time-of-use metering means that we get paid much more for the power we generate in the afternoon than the power we use overnight, so the "zero point" for billing is different than that for power consumption. PG&E recently sent us our annual bill, which was about $500. Added to the monthly bills for basic service this comes to about $700 for the whole year. Our previous electricity bill was about $2000 a year. Since we changed our hot water, clothes dryer and range from propane to electric, we saved over $1000 in propane cost as well. That puts payback at around ten years at todays prices, given the likelihood of increased propane and electricity prices over time, actual payback would be earlier than that.
The panels have got a lot of dust on them (the roof is too high to get at easily to clean them), so aren't running at peak efficiency, the best day we saw was around 28KWh, with a lot of days of 26-27 KWh through the summer. The totals for each month are shown in the screenshots below.
Since we have built a new garage, we now have a lot more roof area. So we are now looking to add another 4KW of panels, and to swap out our propane furnace for a heat pump that will give us heating and cooling (yay - its hot today...) sometime next spring. Then the only use for propane will be the emergency generator. It should also make the garage a bit cooler in the summer by shading half of the roof with panels.
Thursday, August 26, 2010
Netflix for iPhone in the cloud and HTML5
Returning to the theme of my last post, Netflix is hiring engineers to work on cloud tools, platforms and performance, and advanced user interfaces. I think we are breaking new ground in both areas and its an exciting place to be. We have very high standards and are looking for the best people in the industry to come and help us...
Wednesday, August 18, 2010
Eventual Consistency of Cloud?
The first point of disagreement is the claim that public clouds aren't getting the input they need from customers to mature their services, because real enterprise customers aren't running in the public cloud. Well, here at Netflix, we are giving Amazon exactly that kind of input. We use every feature they have, we are driving them hard and Amazon is taking the input and improving their product rapidly in ways that benefit all the users of AWS.
My main disagreement is the claim that lots of individual IT departments will converge on a single standard that will win out over public cloud standards. I find this highly implausible, there are a host of vendors feeding technology to enterprise clouds, and they will all do the usual vendor thing of looking for ways to lock in the customer, even if they base on the same standards their implementations will be different. Here's an example, the "private" Enterprise Unix variants Solaris, AIX, HP-UX, IRIX, OSF/1 etc. are all based on the same Unix standards, however the "public" alternatives are Linux and BSD, and Linux has won the mind share in this space. I see Linux as an analogy for the public cloud in the sense that there is a very low barrier to adoption. For Linux, you can download it to run on any old computer for nothing, learn to use it, then build very low cost solutions out of it. For AWS, for a few dollars on the existing Amazon account you use to buy books etc, you can explore all the features and learn to build very powerful systems in a few hours. This has produced a large population of very productive engineers, who know how to use AWS, Linux and other open source tools to solve problems rapidly at low cost using the same tools. In contrast, if every enterprise cloud moves ahead by solving their problems independently they will produce a variety of architectures, each optimized to their own problem, and with their own tooling, and a very small number of people who know how to run each variant. You will also find that every company also uses the public cloud to get stuff done more quickly and cheaply than the IT department, so that will become the common standard.
Part of the thinking behind Netflix' move to the cloud is that large public cloud providers like Amazon will have far more engineers working on making their infrastructure robust, scalable, well automated and secure than any individual enterprise could afford. By using AWS we are leveraging a huge investment made by Amazon, and paying a small amount for it on a month by month basis. We also get to efficiently allocate resources, for example how much does it cost to provision a large cage of equipment in a new datacenter and how long does it take from deciding to do it, to having it running reliably in production? Let's say $10M and many months. Instead we could spend $10M on licensing more movies and TV shows to stream, and grow incrementally in the cloud. In a few months time we have more customers than if we spent the money up front on buying compute capacity and we just keep re-provisioning new instances in the cloud, so we never end up with a datacenter full of inappropriately sized or obsolete equipment. At present, Netflix' growth is accelerating, so it is difficult to guess in advance how much capacity to invest in, but we have already flat-lined our datacenter capacity, and have all incremental capacity happening on AWS. For example, we can just fire up a few thousand instances for a week to encode all the new movies we just bought the rights to, then stop paying for them until another big deal closes. Likewise on The Oscars Awards night, there is a big spike in web traffic, and we can grow on the day and shrink afterwards as needed without planning it and buying hardware a long time in advance.
While the other public and private cloud vendors are competing to come up with standards, we are finding that resumes from the kind of engineers we want to hire already reference their experience with AWS as a de-facto cloud standard. It's also easier to attract the best people if they will learn transferable skills and work on the very latest technologies.
That might sound like a lock-in, but a well designed architecture is layered, and the actual AWS dependencies are very localized in our code base. The bet on the end game is that in coming years, other cloud vendors produce large scale AWS compatible offerings (full featured, not just EC2 and S3), and a very large scale multi-vendor low cost public cloud market is created. Then even a large and fast growing enterprise like Netflix will be an insignificant and ever smaller proportion of the cloud. By definition, you can't be an insignificant proportion of your own private cloud....
Monday, August 16, 2010
Reducing TCP retransmit timeout?
Here is a relevant paper “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication“
http://www.cs.cmu.edu/~vrv/papers/sigcomm147-vasudevan.pdf
Friday, August 06, 2010
Open letter to my Sun friends at Oracle
I have also been talking to a few friends who stayed at Sun and are now at Oracle, and there is a common thread that I decided to put out there in this blog post.
This week I presented at a local Computer Measurement Group meeting, talking about how easy it is to use the Amazon cloud to run Hadoop jobs to process terabytes of data for a few bucks [slideshare]. I followed a talk on optimizing your Mainframe software licensing costs by tweaking workload manager limits. There are still a lot of people working away on IBM Mainframes, but it's not where interesting new business models go to take over the world.
The way I see the Oracle/Sun merger is that Oracle wanted to compete more directly with IBM, and they will invest in the bits of Sun that help them do that. Oracle has a very strong focus on high margin sales, so they will most likely succeed in making good money with help from Solaris and SPARC to compete with AIX, z/OS and P-series, selling to late-adopter industries like Banking, Insurance etc. Just look where the Mainframes are still being used. Sun could never focus on just the profitable business on its own, because it had a long history of leading edge innovation that is disruptive and low margin. However, what was innovative once is now a legacy technology base of Solaris and SPARC, and it's not even a topic of discussion in the leading edge of disruptive innovators, who are running on x64 in the cloud on Linux and a free open source stack. There is no prospect of revenue for Oracle in this space, so they are right to ignore it.
That is what I meant when I tweeted that Illumos is as irrelevant as Solaris, and it is legacy computing. I don't mean Solaris will go away, I'm sure it will be the basis of a profitable business for a long time, but the interesting things are happening elsewhere, specifically in public cloud and "infrastructure as code".
You might point to Joyent, who use Solaris, and now have Bryan Cantrill on board, but they are a tiny bit-player in cloud computing and Amazon are running away with the cloud market, and creating a set of de-facto standard APIs that make it hard to differentiate and compete. You might point to enterprise or private clouds, but as @scottsanchez tweeted: "Define: Private Cloud ... 1/2 the features of a public cloud, for 4x the cost", that's not where the interesting things are happening.
So to my Sun friends at Oracle, if you want to work for a profitable company and build up your retirement fund Oracle is an excellent place to be. However, there are a lot of people who joined Sun when it was re-defining the computer industry, changing the rules, disrupting the competition. If you want some of that you need to re-tool your skill set a bit and look for stepping stones that can take you there.
When Sun shut down our HPC team in 2004 I deliberately left the Enterprise Computing market, I didn't want to work for a company that sold technology to other companies, I wanted to sell web services to end consumers, and I had contacts at eBay who took me on. In 2007 I joined Netflix, and it's the best place I've ever worked, but I needed that time at eBay to orient myself to a consumer driven business model and re-tool my skill set, I couldn't have joined Netflix directly.
There are two slideshare presentations on the Netflix web site, one is on the company culture, the other on the business model. It is expected that anyone who is looking for a job has read and inwardly digested them both (its basically an interview fail if you haven't). These aren't aspirational puff pieces written by HR, along with everyone else in Netflix management (literally, at a series of large offsites), I was part of the discussion that helped our CEO Reed Hastings write and edit them both.
What can you do to "escape"? The tools are right there, you don't need to invest significant money, you just need to carve out some spare time to use them. Everything is either free open source, or available for a few cents or dollars on the Amazon cloud. The best two things you can have on your resume are hands on experience with the Amazon Web Services tool set, and links to open source projects that you have contributed to. There isn't much demand for C or C++ programmers, but ObjectiveC is an obvious next step, it's quite fun to code in and you can develop user interfaces for iPhone/iPad in a few lines of code, that back-end into cloud services. Java code (for app servers like Tomcat) on Android phones, Ruby-on-Rails, and Python are the core languages that are being used to build innovative new businesses nowadays. If you are into data or algorithms, then you need to figure out how to use Hadoop, which as I describe in one of my slideshare decks is trivially available from Amazon. You can even get an HPC cluster on a 10Gbit ethernet interconnect from Amazon now. There is hadoop based open source algorithm project called Mahout that is always looking for contributors.
To find the jobs themselves, spend time on LinkedIn. I use it to link to anyone I think might be interesting to hire or work with. Your connections have value since it is always good to hire people that know other good people. Keep your own listing current and join groups that you find interesting, like Java Architecture or Cloud Computing, and Sun Alumni. At this point LinkedIn is the main tool used by recruiters and managers to find people.
Good luck, and keep in touch (you can find me on LinkedIn or twitter @adrianco :-)