Friday, October 29, 2010

NoSQL Netflix Use Case Comparison for Cassandra

Jonathan Ellis @spyced of Riptano kindly provided a set of answers that I have interspersed with the questions below.

The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
The most natural way to model per-user favorites in Cassandra is to have one row per user, keyed by the userid, whose column names are movie IDs. The combination of allowing dynamic column creation within a row and allowing very large rows (up to 2 billion columns in 0.7) means that you can treat a row as a list or map, which is a natural fit here. Performance will be excellent since columns can be added or modified without needing to read the row first. (This is one reason why thinking of Cassandra as a key/value store, even before we added secondary indexes, was not really correct.)

The best introduction to Cassandra data modeling is Max Grinev's series on
basics, translating SQL concepts, and idempotence.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
Briefly, both reads and writes have a ConsistencyLevel parameter controlling how many replicas across how many zones must reply for the request to succeed. Routing is aware of current response times as well as network topology, so given an appropriate ConsistencyLevel, reads can be routed around temporarily slow nodes.

On writes, the coordinator node (the one the client sent the request to) will send the write to all replicas; as soon as enough success messages come back to satisfy the desired consistency level, the coordinator will report success to the client.

For more on consistency levels, see
Ben Black's excellent presentation.
Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Cassandra has no 'master copy' for any piece of data; all copies are equal. The other behaviors are supported by different ConsistencyLevel values for reads (R) and writes (W):

R=QUORUM, W=QUORUM: One zone decides that it is still working for reads and writes, and the other zone decides it is offline
R=ONE, W=ALL: Both zones continue to satisfy reads, but refuse writes
R=ONE, W=ONE: Both zones continue to accept writes, and reconcile any inconsistencies when the partition heals

I would also note that reconciliation is timestamp-based at the column level, meaning that updates to different columns within a row will never conflict, but when writes have been allowed in two partitions to the same column, the highest timestamp will win. (This is another way Cassandra differs from key/value stores, which need more complex logic called vector clocks to be able to merge updates to different logical components of a value.)
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?
Cassandra's ColumnFamily model generally obviates the need for a read before a write, e.g., as above using movie IDs as column names. (If you wanted to allow duplicates in the list for some reason, you would generally use a UUID as the column name on insert instead of the movie ID.)

The maximum value size is 2GB although in practice we recommend using 8MB as a more practical maximum. Splitting a larger blob up across multiple columns is straightforward given the dynamic ColumnFamily design. The maximum row size is 2 billion columns. Queries by attribute value are supported with secondary indexes in 0.7.
Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?
Cassandra handles repairing corruption the same way it does other data inconsistencies, with read repair and anti-entropy repair.
Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?
Because Cassandra's data files are immutable once written, creating a point-in-time snapshot is as simple as hard-linking the current set of sstables on the filesystem. Performance impact is negligible since hard links are so lightweight. Rolling back simply consists of moving a set of snapshotted files into the live data directory. The snapshot is as consistent as your ConsistencyLevel makes it: any write visible to readers at a given ConsistencyLevel before the snapshot will be readable from the snapshot after restore. The only scalability problem with snapshot management is that past a few TB, it becomes impractical to try to manage snapshots centrally; most companies leave them distributed across the nodes that created them.

5 comments:

  1. great writeup!

    thx for sharing.

    ReplyDelete
  2. I'm confused about the suggestion of hard links as a strategy for implementing backup. Sure, this gets you a snapshot, but if your disk gets hosed, you're out of luck. I guess the assumption is that backup is accounted for by replication?

    ReplyDelete
  3. To test my understanding: in order to satisfy the read-after-write case, the ConsistencyLevel will have to be equal to the number of replicas, otherwise the read won't return the modified list. However, in the case of a partitioning event, doesn't that imply that the write must fail (be refused, or the co-ordinator times out, etc) because not all of the replicas can be reached?

    ReplyDelete
  4. Michael: Since they are immutable, I would think that you can simply make a hard-link to have a local "copy", and to make a backup elsewhere you would simply copy that file.

    (Disclaimer: I have never used Cassandra – but that's how I read the response.)

    ReplyDelete
  5. Very detailed and insightful

    ReplyDelete