Sunday, October 31, 2010

NoSQL Netflix Use Case Comparison for MongoDB

Roger Bodamer @rogerb from 10gen.com kindly provided a set of answers for MongoDB that I have interspersed with the questions below. The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons. If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?

Let's assume for discussion purposes that we are using MongoDB across three availability zones in a region. We would have a replica set member in each of the three zones. One member will be elected primary at a given point in time.


All writes will be sent to the primary, and then propagate to secondaries from there. Thus, writes are often inter-zone. However availability zones are fairly low latency (I assume the context here is EC2).


Reads can be either to the primary, if immediate/strong consistency semantics are desired, or to the local zone member, if eventually consistent read semantics are acceptable.


Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Let's assume again we are using three zones - we could use two but three is more interesting. To be primary in a replica set, the primary must be visible to a majority of the members of the set: in this case, two thirds of the members, or two thirds of the zones. If one zone is partitioned from the other two, what will happen is: a member in the 2 zone side of the partition will become primary, if not already. It will be available for reads and writes.

The minority partition will not service writes. Eventually consistent reads are still possible in the minority partition.

Once the partition heals, the servers automatically reconcile.

http://www.mongodb.org/display/DOCS/Replica+Set+Design+Concepts
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions?

MongoDB supports atomic operations on single documents via both its $ operators ($set, $inc) and also by compare-and-swap operations. In MongoDB one could model the list as a document per favorite, or, put all the favorites in a single BSON object. In both cases atomic operations free of race conditions are possible. http://www.mongodb.org/display/DOCS/Atomic+Operations


This is why mongodb elects a node primary: to facilitate these atomic operations for use cases where these semantics are required.


If multiple attribute/values are supported for a key, can an additional value be written directly without reading first?
Yes.
What limits exist on the size of the value or number of attribute/values?

A single BSON document must be under the limit -- currently that limit is 8MB. If larger than this, one should consider modeling as multiple documents during schema design.

and are queries by attribute/value supported?

Yes. For performance, MongoDB supports secondary (composite) indices.


Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

The general assumption is that the storage system is reliable. Thus, one would normally use a RAID with mirroring, or a service like EBS which has intrinsic mirroring.


However, the BSON format has a reasonable amount of structure to it. It is highly probable, although not certain, that a corrupt object would be detected and an error reported. This could then be correct with a database repair operation.


Note: the above assumes an actual storage system fault. Another case of interest is simply a hard crash of the server. MongoDB 1.6 requires a --repair after this. MongoDB v1.8 (pending) is crash-safe in its storage engine via journaling.


Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup?

The most used method is to have a replica which is used for backups only; perhaps an inexpensive server or VM. This node can be taken offline at any time and any backup strategy used. Once re-enabled, it will catch back up. http://www.mongodb.org/display/DOCS/Backups


With something like EBS, quick snapshotting is possible using the fsync-and-lock command.


For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?

One can stop the server(s), restore the old data file images, and restart.


MongoDB supports a slaveDelay option which allows one to force a replica to stay a certain number of hours behind realtime. This is a good way to maintain a rolling backup in case of someone "fat-fingering" a database operation.

1 comment:

  1. This shared a number of characteristics with a sharded MySQL setup including the use of out-of-service replicas for backup purposes, and the existence of a primary. A key difference is that the primary is elected and managed by the system rather than by a human operator. However, the ability to direct a query to the primary allows the implementation of a web service that can provide consistent read-after-write behavior. Such a web service can return a short-lived cookie; a client that requires read-after-write consistency returns the cookie on the subsequent request. The cookie can contain the id's of objects that have been recently modified, and when those modifications took place. The cookie lifetime is chosen to be relatively certain that replication has occurred before the cookie's expiration. When the service receives a request with the cookie, if the request is to read an object listed in the cookie, and the request is within a period too short to be reasonably certain of replication across all replicas, the service can direct the read to the primary to see the recently written content. It's not guaranteed, but its pretty good, and other measures can be taken to preclude duplication (unique indexes); I've done this before with MySQL-based web services to direct read-after-writes to the master.

    ReplyDelete