Adrian Cockcroft's Blog: October 2010

Sunday, October 31, 2010

NoSQL Netflix Use Case Comparison for MongoDB

Roger Bodamer @rogerb from 10gen.com kindly provided a set of answers for MongoDB that I have interspersed with the questions below. The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons. If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders

While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Let's assume for discussion purposes that we are using MongoDB across three availability zones in a region. We would have a replica set member in each of the three zones. One member will be elected primary at a given point in time.

All writes will be sent to the primary, and then propagate to secondaries from there. Thus, writes are often inter-zone. However availability zones are fairly low latency (I assume the context here is EC2).

Reads can be either to the primary, if immediate/strong consistency semantics are desired, or to the local zone member, if eventually consistent read semantics are acceptable.

Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?

one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
both zones continue to satisfy reads, but refuse writes until repaired
data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
both zones continue to accept writes, and attempt to reconcile any inconsistency on repair

Let's assume again we are using three zones - we could use two but three is more interesting. To be primary in a replica set, the primary must be visible to a majority of the members of the set: in this case, two thirds of the members, or two thirds of the zones. If one zone is partitioned from the other two, what will happen is: a member in the 2 zone side of the partition will become primary, if not already. It will be available for reads and writes.

The minority partition will not service writes. Eventually consistent reads are still possible in the minority partition.

Once the partition heals, the servers automatically reconcile.

http://www.mongodb.org/display/DOCS/Replica+Set+Design+Concepts
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions?

MongoDB supports atomic operations on single documents via both its $ operators ($set, $inc) and also by compare-and-swap operations. In MongoDB one could model the list as a document per favorite, or, put all the favorites in a single BSON object. In both cases atomic operations free of race conditions are possible. http://www.mongodb.org/display/DOCS/Atomic+Operations

This is why mongodb elects a node primary: to facilitate these atomic operations for use cases where these semantics are required.

If multiple attribute/values are supported for a key, can an additional value be written directly without reading first?

Yes.

What limits exist on the size of the value or number of attribute/values?

A single BSON document must be under the limit -- currently that limit is 8MB. If larger than this, one should consider modeling as multiple documents during schema design.

and are queries by attribute/value supported?

Yes. For performance, MongoDB supports secondary (composite) indices.

Question 4: Handling Silent Data Corruption

When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

The general assumption is that the storage system is reliable. Thus, one would normally use a RAID with mirroring, or a service like EBS which has intrinsic mirroring.

However, the BSON format has a reasonable amount of structure to it. It is highly probable, although not certain, that a corrupt object would be detected and an error reported. This could then be correct with a database repair operation.

Note: the above assumes an actual storage system fault. Another case of interest is simply a hard crash of the server. MongoDB 1.6 requires a --repair after this. MongoDB v1.8 (pending) is crash-safe in its storage engine via journaling.

Question 5: Backup and Restore

Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup?

The most used method is to have a replica which is used for backups only; perhaps an inexpensive server or VM. This node can be taken offline at any time and any backup strategy used. Once re-enabled, it will catch back up. http://www.mongodb.org/display/DOCS/Backups

With something like EBS, quick snapshotting is possible using the fsync-and-lock command.

For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?

One can stop the server(s), restore the old data file images, and restart.

MongoDB supports a slaveDelay option which allows one to force a replica to stay a certain number of hours behind realtime. This is a good way to maintain a rolling backup in case of someone "fat-fingering" a database operation.

Friday, October 29, 2010

NoSQL Netflix Use Case Comparison for Cassandra

Jonathan Ellis @spyced of Riptano kindly provided a set of answers that I have interspersed with the questions below.

The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

The most natural way to model per-user favorites in Cassandra is to have one row per user, keyed by the userid, whose column names are movie IDs. The combination of allowing dynamic column creation within a row and allowing very large rows (up to 2 billion columns in 0.7) means that you can treat a row as a list or map, which is a natural fit here. Performance will be excellent since columns can be added or modified without needing to read the row first. (This is one reason why thinking of Cassandra as a key/value store, even before we added secondary indexes, was not really correct.)

The best introduction to Cassandra data modeling is Max Grinev's series on basics, translating SQL concepts, and idempotence.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?

Briefly, both reads and writes have a ConsistencyLevel parameter controlling how many replicas across how many zones must reply for the request to succeed. Routing is aware of current response times as well as network topology, so given an appropriate ConsistencyLevel, reads can be routed around temporarily slow nodes.

On writes, the coordinator node (the one the client sent the request to) will send the write to all replicas; as soon as enough success messages come back to satisfy the desired consistency level, the coordinator will report success to the client.

For more on consistency levels, see Ben Black's excellent presentation.

one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
both zones continue to satisfy reads, but refuse writes until repaired
data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
both zones continue to accept writes, and attempt to reconcile any inconsistency on repair

Cassandra has no 'master copy' for any piece of data; all copies are equal. The other behaviors are supported by different ConsistencyLevel values for reads (R) and writes (W):

R=QUORUM, W=QUORUM: One zone decides that it is still working for reads and writes, and the other zone decides it is offline
R=ONE, W=ALL: Both zones continue to satisfy reads, but refuse writes
R=ONE, W=ONE: Both zones continue to accept writes, and reconcile any inconsistencies when the partition heals

I would also note that reconciliation is timestamp-based at the column level, meaning that updates to different columns within a row will never conflict, but when writes have been allowed in two partitions to the same column, the highest timestamp will win. (This is another way Cassandra differs from key/value stores, which need more complex logic called vector clocks to be able to merge updates to different logical components of a value.)

Question 3: Appending a movie to the favorites list

If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?

Cassandra's ColumnFamily model generally obviates the need for a read before a write, e.g., as above using movie IDs as column names. (If you wanted to allow duplicates in the list for some reason, you would generally use a UUID as the column name on insert instead of the movie ID.)

The maximum value size is 2GB although in practice we recommend using 8MB as a more practical maximum. Splitting a larger blob up across multiple columns is straightforward given the dynamic ColumnFamily design. The maximum row size is 2 billion columns. Queries by attribute value are supported with secondary indexes in 0.7.

Question 4: Handling Silent Data Corruption

When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

Cassandra handles repairing corruption the same way it does other data inconsistencies, with read repair and anti-entropy repair.

Question 5: Backup and Restore

Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?

Because Cassandra's data files are immutable once written, creating a point-in-time snapshot is as simple as hard-linking the current set of sstables on the filesystem. Performance impact is negligible since hard links are so lightweight. Rolling back simply consists of moving a set of snapshotted files into the live data directory. The snapshot is as consistent as your ConsistencyLevel makes it: any write visible to readers at a given ConsistencyLevel before the snapshot will be readable from the snapshot after restore. The only scalability problem with snapshot management is that past a few TB, it becomes impractical to try to manage snapshots centrally; most companies leave them distributed across the nodes that created them.

Wednesday, October 27, 2010

Comparing NoSQL Availability Models

let's risk feeding the CAP trolls, and try to get some insight into the differences between the many NoSQL contenders. I have circulated an earlier version of this to a few people and got at least one good response. If you have answers, or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?

Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?

one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
both zones continue to satisfy reads, but refuse writes until repaired
data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
both zones continue to accept writes, and attempt to reconcile any inconsistency on repair

Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?

Question 4: Handling Silent Data Corruption

When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

Question 5: Backup and Restore

Sunday, October 10, 2010

Netflix in the Cloud

I'm presenting this talk on Thursday at the Cloud Computing Meetup, and again on Nov 3rd at QConSF. So far I have posted a "teaser" summary on slideshare. After QCon I will post the full slide deck [update: combined deck from both talks posted here].

The meetup is the "beta test" of the presentation. It's in Mountain View at "Hacker Dojo", and at the time of writing 437 people have signed up to attend. If everyone turns up it's going to be crazy and over-flowing trying to park and get in, so get there early.... I will focus the meetup talk more on the operational aspects of the cloud architecture, and migration techniques.

At QConSF, the presentation is in the "Architectures you've always wondered about" track, and I will spend more time talking about the software architecture.

Why give these talks? We aren't trying to sell cloud to CIOs, it's all about hiring, we are talking at engineer focused events in the Bay Area. Netflix is growing fast, pathfinding new technologies in the public cloud to support a very agile business model, and is trying to attract the best and brightest people. We use LinkedIn a lot when we search for positions, so feel free to connect to me if you think this could be interesting, or follow @adrianco on Twitter.

Wednesday, October 06, 2010

Time for some climate action on 10/10/10

organized by 350.org there are events all over the world. The significance of 350 is that it is the safe level of CO2 in the atmosphere. We are already at 390ppm, which is why the climate is changing and extreme weather events are becoming common. The level is going up by 2ppm a year at the moment and the changes needed to reverse this haven't started, so its going to go much higher and have increasingly worse effects.

Even if all you do is buy and start reading a copy of James Hansen's book Storms of my Grandchildren (you can get it instantly from the Kindle store) it will help.

Here is a useful discussion of "The Value of Coherence in Science" which is how you can tell what makes sense from what must be wrong. I'm a Physics graduate, and I was taught that science works by making predictions and testing them, and by eliminating incoherent propositions. i.e. if an argument contradicts itself it must be false. The denialist counter-arguments fail this test. To see why the denialist arguments are getting any air time at all, read Merchants of Doubt and Climate Cover Up.

For an example of incoherent and shoddy denialist work John Mashey has systematically deconstructed the Wegman Report which was presented to Congress as an impartial study when it was nothing of the sort.

Sunday, October 03, 2010

More solar power

We just signed up for almost 8KW on the garage roof. It's justified based on replacing our propane furnace with a heat pump that also gives us aIr conditioning, and also running an electric car (e.g. Nissan Leaf). We can use our existing monitoring system, but unlike last time, when we bought the system, we plan to lease it from Solar City this time. There Is a four month lead time, so we can pick the exact terms of the lease next February before it is installed. There are several options, zero down with a monthly charge that increases slightly each year, or options to pay various amounts up front with a fixed monthly payment. The monthly payments are all less than a typical electricity bill.

The goal for next year is to only use propane to run the backup generator (usually for a few days a year), and to make a dent in our petrol usage for the daily commute. We are on the Nissan Leaf list, but not in the first wave of owners. We should be able to get a test drive soon, I will report on any progress as it happens.

If you are interested in electric cars, look for Robert Llewellyn's Fully Charged video show on YouTube. He has been test driving everything and he's good fun to watch. You may recognize him as the actor who played Kryton in Red Dwarf.

Adrian Cockcroft's Blog

Archive

Sunday, October 31, 2010

NoSQL Netflix Use Case Comparison for MongoDB

Friday, October 29, 2010

NoSQL Netflix Use Case Comparison for Cassandra

Wednesday, October 27, 2010

Comparing NoSQL Availability Models

Sunday, October 10, 2010

Netflix in the Cloud

Wednesday, October 06, 2010

Time for some climate action on 10/10/10

Sunday, October 03, 2010

More solar power