Showing posts with label nosql. Show all posts
Showing posts with label nosql. Show all posts

Thursday, January 19, 2012

Thoughts on SimpleDB, DynamoDB and Cassandra

I've been getting a lot of questions about DynamoDB, and these are my personal thoughts on the product, and how it fits into cloud architectures.

I'm excited to see the release of DynamoDB, it's a very impressive product with great performance characteristics. It should be the first option for startups and new cloud projects on AWS. I also think it marks the turning point on solid state disks, they will be the default for new database products and benchmarks going forward.

There are a few use cases where SimpleDB may still be useful, but DynamoDB replaces it in almost all cases. I've talked about the history of Netflix use of SimpleDB before, but it's relevant to the discussion on DynamoDB, so here goes.

When Netflix was looking at moving to cloud about three years ago we had an internal debate about how to handle storage on AWS. There were strong proponents for using MySQL, SimpleDB was fairly new, and other alternatives were nascent NoSQL projects or expensive enterprise software. We started some pathfinder projects to explore the two alternatives and decided to port an existing MySQL app to AWS, while building a replication pipeline that copied data out of our Oracle datacenter systems into SimpleDB to be consumed in the cloud. The MySQL experience showed that we would have trouble scaling, and SimpleDB seemed reliable, so we went ahead and kept building more data sources on SimpleDB, with large blobs of data on S3.

Along the way we put memcached in front of SimpleDB and S3 to improve read latency. The durability of SimpleDB is its strongest point, we have had Oracle and SAN data corruption bugs in the Datacenter over the last few years, but never lost or corrupted any SimpleDB data. The limitations of SimpleDB are its biggest problem. We worked around limits on table size, row and attribute size, and per-request overhead caused by http requests needing to be authenticated for every call.

So the lesson here is that for a first step into NoSQL, we went with a hosted solution so that we didn't have to build a team of experts to run it, and we didn't have to decide in advance how much scale we needed. Starting again from scratch today, I would probably go with DynamoDB. It's a low "friction" and developer friendly solution.

Late in 2010 we were planning the next step, turning off Oracle and making the cloud the master copy of the data. One big problem is that our backups and archives were based on Oracle, and there was no way to take a snapshot or incremental backup of SimpleDB. The only way to get data out in bulk is to run "SELECT * FROM table" over HTTP and page the requests. This adds load, takes too long, and costs a lot because SimpleDB charges for the time taken in SELECT calls.

We looked at about twenty NoSQL options during 2010, trying to understand what the differences were, and eventually settled on Cassandra as our candidate for prototyping. In a week or so, we had ported a major data source from SimpleDB to Cassandra and were getting used to the new architecture, running benchmarks etc. We evaluated some other options, but decided to take Cassandra to the next stage and develop a production data source using it.

The things we liked about Cassandra were that it is written in Java (we have a building full of Java engineers), it is packed full of state of the art distributed systems algorithms, we liked the code quality, we could get commercial support from Datastax, it is scalable and as an extra bonus it had multi-region support. What we didn't like so much was that we had to staff a team to own running Cassandra for us, but we retrained some DBAs and hired some good engineers including Vijay Parthasarathy, who had worked on the multi-region Cassandra development at Cisco Webex and who recently became a core committer on the Apache Cassandra project. We also struggled with the Hector client library, and have written our own (which we plan to release soon). The blessing and a curse of Cassandra is that it is an amazingly fast moving project. New versions come fast and furiously, which makes it hard to pick a version to stabilize on, however the changes we make turn up in the mainstream releases after a few weeks. Saying "Cassandra doesn't do X" is more of a challenge than a statement. If we need "X" we work with the rest of the Apache project to add it.

Throughout 2010 the product teams at Netflix gradually moved their backend data sources to Cassandra, we worked out the automation we needed to do easy self service deployments and ended up with a large number of clusters. In preparation for the UK launch we also made changes to Cassandra to better support multi-region deployment on EC2, and we are currently running several Cassandra clusters that span the US and European AWS regions.

Now that DynamoDB has been released, the obvious question is whether Netflix has any plans to use it. The short answer is no, because it's a subset of the Cassandra functionality that we depend on. However that doesn't detract from the major step forward from SimpleDB in performance, scalability and latency. For new customers, or people who have outgrown the scalability of MySQL or MongoDB, DynamoDB is an excellent starting point for data sources on AWS. The advantages of zero administration combined with the performance and scalability of a solid state disk backend are compelling.

Personally my main disappointment with DynamoDB is that it doesn't have any snapshot or incremental backup capability. The AWS answer is that you can extract data into EMR then store it in S3. This is basically the same answer as SimpleDB, it's a full table scan data extraction (which takes too long and costs too much and isn't incremental). The mechanism we built for Cassandra leverages the way that Cassandra writes immutable files to get a callback and compress/copy them to S3 as they are written, it's extremely low overhead. If we corrupt our data with a code bug and need to roll back, or take a backup in production and restore in test, we have all the files archived in S3.

One argument against DynamoDB is that DynamoDB is on AWS only, so customers could get locked in, however it's easy to upgrade applications from SimpleDB, to DynamoDB and to Cassandra. They have similar schema models, consistency options and availability models. It's harder to go backwards, because Cassandra has more features and fewer restrictions. Porting between NoSQL data stores is trivial compared to porting between relational databases, due to the complexity of the SQL language dialects and features and the simplicity of the NoSQL offerings. Starting out on DynamoDB then switching to Cassandra when you need more direct control over the installation or Cassandra specific features like multi-region support is a very viable strategy.

As early adopters we have had to do a lot more pioneering engineering work than more recent cloud converts. Along the way we have leveraged AWS heavily to accelerate our own development, and built a lot of automation around Cassandra. While SimpleDB has been a minor player in the NoSQL world DynamoDB is going to have a much bigger impact. Cassandra has matured and got easier to use and deploy over the last year but it doesn't scale down as far. By that I mean a single developer in a startup can start coding against DynamoDB without needing any support and with low and incremental costs. The smallest Cassandra cluster we run is six m1.xlarge instances spread over three zones with triple replication.

I've been saying for a while that 2012 is the year that NoSQL goes mainstream, DynamoDB is another major step in validating that move. The canonical CEO to CIO conversation is moving from 2010: "what's our cloud strategy?", 2011: "what's our big data strategy?" to 2012: "what's our NoSQL strategy?".

Wednesday, March 02, 2011

Maslow's Hierarchy of NoSQL Reads (and Writes)

I tried out Prezi to create this presentation, it was more fun to create than powerpoint but has a few limitations. I would like better drawing tools. The talk itself puts some of the main NoSQL solutions into context.



Wednesday, November 17, 2010

NoSQL Netflix Use Case Comparison for Translattice

[There is some discussion of this posting with comments by Michael at Slashdot]
Michael Lyle @mplyle CTO of Translattice kindly provided a set of answers that I have interspersed with the questions below. Translattice isn't technically a NoSQL system, but it isn't a conventional database either. It's a distributed relational SQL database that supports eventual consistency, as Michael puts it:
These answers are for the Translattice Application Platform (TAP)'s database component. Unlike other stores that have answered this question set, TAP contains a relational database that scales out over identical nodes. TAP further allows applications written to the J2EE platform to scale out across the same collection of nodes.
The original set of questions are posted here. Each respondent will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
In the Translattice Application Platform, data in relational tables is transparently sharded behind the scenes by the data store. These shards are stored redundantly across the nodes. Reads are satisfied with the most local copy of data available on the network, unless that resource is currently overloaded in which case the system may fall back to reads from more distant locations.

When it comes to writes, applications have the choice on the durability and isolation levels for changes. Each transaction may be made in a fully synchronous, serializable isolation level, or may be made in a locked eventually consistent mode that provides ACID serializable semantics except that durability may be sacrificed if an availability zone fails. A further asynchronous mode allows potentially conflicting changes to be made and allows a user-provided reconciliation function to decide which change "wins". A final commit requires a majority of nodes storing a shard to be available; in the case of the fully synchronous mode this would delay or prevent the return of success if a critical subset of the cluster fails.

Policy mechanisms in the system allow administrators to specify how physical and cloud database instances correspond to administratively-relevant zones. An administrator can choose to require, for instance, that each piece of information is replicated to at least three database nodes across a total of two availability zones. An administrator may also use these mechanisms to require that particular tables or portions of tables must or must not be stored in a given zone (for instance, to meet compliance or security requirements). Within the constraints set by policy, the system tracks usage patterns and places information in the most efficient locations.

Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Assuming that the SQL transaction in question is running in the fully synchronous or eventually consistent locked mode, writes will only be allowed in one of the two zones. Reads will continue in both zones, but will only be able to satisfy requests for which at least one replica of the requested data exists in the local zone (policy can be specified to ensure that this is always the case). In the eventually consistent mode, multiple partitioned portions of the system can accept writes and reconcile later. Essentially, any of the above desired modes can be used on a transaction-by-transaction basis depending on application and performance requirements.
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?
Because fully relational primitives are provided, there can easily be one row in the database per favorite. Read-modify-write of the whole list is not required, and the only practical limits are application-defined.

Any SQL queries are supported against the store, and are transformed by the query planner into an efficient plan to execute the query across the distributed system. Of course, how efficient a query is to execute will depend on the structure of the data and the indexes that an administrator has created. We think this allows for considerable flexibility and business agility as the exact access methods that will be used on the data do not need to be fully determined in advance.
Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?
Network activity is protected by cryptographic hash authentication, which provides integrity verification as a side benefit. Distributed transactions also take place through a global consensus protocol that uses hashes to ensure that checkpoints are in a consistent state (this is also how the system maintains transactional integrity and consistency when changes cross many shards). Significant portions of the on-disk data are also presently protected by checksums and allow the database to "fail" a disk if corrupt data is read.
Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?
A good portion of this relational database's consistency model is implemented through a distributed multi-version concurrency control (MVCC) system. Tuples that are in-use are preserved as the database autovacuum process will not remove tuples until it is guaranteed that no one could be looking at them anymore. This allows a consistent version of the tables as of a point in time to be viewed from within a transaction; so BEGIN TRANSACTION; SELECT ... [or COPY FROM, to backup] ; COMMIT; works. We provide mechanisms to allow database dumps to occur via this type of mechanism.
In the future we are likely to use this mechanism to allow quick snapshots of the entire database and rollbacks to previous snapshot versions (as well as to allow the use of snapshots to stage development versions of application code without affecting production state).

Tuesday, November 09, 2010

NoSQL Netflix Use Case Comparison for Riak

Justin Sheehy @justinsheehy of Basho kindly provided a set of answers that I have interspersed with the questions below.

The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?

There are two possibilities with Riak. The first would be to spread a single Riak cluster across all three zones, for example one node in each of three zones. In this case, a single replica of each item would exist in each zone. Whether or not a response needed to wait on cross-zone traffic to complete would depend on the consistency level in the individual request. The second option would require Riak EnterpriseDS, and involves placing a complete cluster in each zone and configuring them to perform inter-cluster replication. This has multiple advantages. Every request would be satisfied entirely locally, and would be independent of latency or availability characteristics across zone boundaries. Another benefit is that (unlike either the first scenario or some other solutions that spread clusters and quorums over a long haul) read requests would not generate any cross-zone traffic at all. For an application with a high percentage of reads, this can make a large difference.

Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair

As write-availability is a central goal achieved in Riak, the fourth option will be the observed behavior. This is the case regardless of the strategy chosen for Question 1. In the first strategy, local nodes other than the canonical homes for given data will accept the writes instead, using the hinted-handoff technique. In the second strategy, the local cluster will accept the write, those changes will be replayed across the replication link when the zones are reconnected. In all cases, vector clocks provide a clean way of resolving most inconsistency, and various reconciliation models are available to the user for those cases which cannot be syntactically resolved.


For more information on vector clocks in Riak, see:


http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/

and

http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/


Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?

Riak will use vector clocks to recognize causality in race conditions. In the case of two overlapping writes to the same value, Riak will retain both unless explicitly requested to simply overwrite with the last value received. If one client changes A to B and another changes A to C, then (unless told to overwrite) Riak will return both B and C to the client. When that client then modifies the object again, the single descendant "D" that they created will be the new value. For applications such as sets which are mostly added to and rarely deleted from, the application code to perform this reconciliation is trivial and in some cases is simply a set union operation. This would look a bit like this in terms of vector clock ancestry:

http://dl.dropbox.com/u/751099/ndiag1.png




Riak allows values to be of any arbitrary content type, but if the content is in JSON then a JavaScript map/reduce request can be used to query by attribute/value.

Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

Many layers of Riak perform consistency checking, including CRC checking in the persistence engine and object equality in the distributed state machines handling requests. In most cases where corruption can be detected in a given replica of some item, that replica will immediately but asynchronously be fixed via read-repair.

Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?

There are two approaches to back up Riak systems: per-node or whole-cluster. Backing up per-node is the easiest option for many people, and is quite simple. Due to bitcask (the default storage engine) performing writes in an append-only fashion and never re-opening any file for writing once closed, Riak nodes can easily be backed up via the filesystem backup method of your choice. Simply replacing the content of the data directory will reset a node's stored content to what it held at the time. Alternately, a command line backup command is available which will write out a backup of all data on the cluster. This is fairly network and disk intensive and requires somewhere to put a whole-cluster backup, but is very useful for prototyping situations which are not holding enormous amounts of data.


Sunday, October 31, 2010

NoSQL Netflix Use Case Comparison for MongoDB

Roger Bodamer @rogerb from 10gen.com kindly provided a set of answers for MongoDB that I have interspersed with the questions below. The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons. If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?

Let's assume for discussion purposes that we are using MongoDB across three availability zones in a region. We would have a replica set member in each of the three zones. One member will be elected primary at a given point in time.


All writes will be sent to the primary, and then propagate to secondaries from there. Thus, writes are often inter-zone. However availability zones are fairly low latency (I assume the context here is EC2).


Reads can be either to the primary, if immediate/strong consistency semantics are desired, or to the local zone member, if eventually consistent read semantics are acceptable.


Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Let's assume again we are using three zones - we could use two but three is more interesting. To be primary in a replica set, the primary must be visible to a majority of the members of the set: in this case, two thirds of the members, or two thirds of the zones. If one zone is partitioned from the other two, what will happen is: a member in the 2 zone side of the partition will become primary, if not already. It will be available for reads and writes.

The minority partition will not service writes. Eventually consistent reads are still possible in the minority partition.

Once the partition heals, the servers automatically reconcile.

http://www.mongodb.org/display/DOCS/Replica+Set+Design+Concepts
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions?

MongoDB supports atomic operations on single documents via both its $ operators ($set, $inc) and also by compare-and-swap operations. In MongoDB one could model the list as a document per favorite, or, put all the favorites in a single BSON object. In both cases atomic operations free of race conditions are possible. http://www.mongodb.org/display/DOCS/Atomic+Operations


This is why mongodb elects a node primary: to facilitate these atomic operations for use cases where these semantics are required.


If multiple attribute/values are supported for a key, can an additional value be written directly without reading first?
Yes.
What limits exist on the size of the value or number of attribute/values?

A single BSON document must be under the limit -- currently that limit is 8MB. If larger than this, one should consider modeling as multiple documents during schema design.

and are queries by attribute/value supported?

Yes. For performance, MongoDB supports secondary (composite) indices.


Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

The general assumption is that the storage system is reliable. Thus, one would normally use a RAID with mirroring, or a service like EBS which has intrinsic mirroring.


However, the BSON format has a reasonable amount of structure to it. It is highly probable, although not certain, that a corrupt object would be detected and an error reported. This could then be correct with a database repair operation.


Note: the above assumes an actual storage system fault. Another case of interest is simply a hard crash of the server. MongoDB 1.6 requires a --repair after this. MongoDB v1.8 (pending) is crash-safe in its storage engine via journaling.


Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup?

The most used method is to have a replica which is used for backups only; perhaps an inexpensive server or VM. This node can be taken offline at any time and any backup strategy used. Once re-enabled, it will catch back up. http://www.mongodb.org/display/DOCS/Backups


With something like EBS, quick snapshotting is possible using the fsync-and-lock command.


For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?

One can stop the server(s), restore the old data file images, and restart.


MongoDB supports a slaveDelay option which allows one to force a replica to stay a certain number of hours behind realtime. This is a good way to maintain a rolling backup in case of someone "fat-fingering" a database operation.

Friday, October 29, 2010

NoSQL Netflix Use Case Comparison for Cassandra

Jonathan Ellis @spyced of Riptano kindly provided a set of answers that I have interspersed with the questions below.

The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
The most natural way to model per-user favorites in Cassandra is to have one row per user, keyed by the userid, whose column names are movie IDs. The combination of allowing dynamic column creation within a row and allowing very large rows (up to 2 billion columns in 0.7) means that you can treat a row as a list or map, which is a natural fit here. Performance will be excellent since columns can be added or modified without needing to read the row first. (This is one reason why thinking of Cassandra as a key/value store, even before we added secondary indexes, was not really correct.)

The best introduction to Cassandra data modeling is Max Grinev's series on
basics, translating SQL concepts, and idempotence.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
Briefly, both reads and writes have a ConsistencyLevel parameter controlling how many replicas across how many zones must reply for the request to succeed. Routing is aware of current response times as well as network topology, so given an appropriate ConsistencyLevel, reads can be routed around temporarily slow nodes.

On writes, the coordinator node (the one the client sent the request to) will send the write to all replicas; as soon as enough success messages come back to satisfy the desired consistency level, the coordinator will report success to the client.

For more on consistency levels, see
Ben Black's excellent presentation.
Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Cassandra has no 'master copy' for any piece of data; all copies are equal. The other behaviors are supported by different ConsistencyLevel values for reads (R) and writes (W):

R=QUORUM, W=QUORUM: One zone decides that it is still working for reads and writes, and the other zone decides it is offline
R=ONE, W=ALL: Both zones continue to satisfy reads, but refuse writes
R=ONE, W=ONE: Both zones continue to accept writes, and reconcile any inconsistencies when the partition heals

I would also note that reconciliation is timestamp-based at the column level, meaning that updates to different columns within a row will never conflict, but when writes have been allowed in two partitions to the same column, the highest timestamp will win. (This is another way Cassandra differs from key/value stores, which need more complex logic called vector clocks to be able to merge updates to different logical components of a value.)
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?
Cassandra's ColumnFamily model generally obviates the need for a read before a write, e.g., as above using movie IDs as column names. (If you wanted to allow duplicates in the list for some reason, you would generally use a UUID as the column name on insert instead of the movie ID.)

The maximum value size is 2GB although in practice we recommend using 8MB as a more practical maximum. Splitting a larger blob up across multiple columns is straightforward given the dynamic ColumnFamily design. The maximum row size is 2 billion columns. Queries by attribute value are supported with secondary indexes in 0.7.
Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?
Cassandra handles repairing corruption the same way it does other data inconsistencies, with read repair and anti-entropy repair.
Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?
Because Cassandra's data files are immutable once written, creating a point-in-time snapshot is as simple as hard-linking the current set of sstables on the filesystem. Performance impact is negligible since hard links are so lightweight. Rolling back simply consists of moving a set of snapshotted files into the live data directory. The snapshot is as consistent as your ConsistencyLevel makes it: any write visible to readers at a given ConsistencyLevel before the snapshot will be readable from the snapshot after restore. The only scalability problem with snapshot management is that past a few TB, it becomes impractical to try to manage snapshots centrally; most companies leave them distributed across the nodes that created them.

Wednesday, October 27, 2010

Comparing NoSQL Availability Models

let's risk feeding the CAP trolls, and try to get some insight into the differences between the many NoSQL contenders. I have circulated an earlier version of this to a few people and got at least one good response. If you have answers, or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.


Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone? 

Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?

Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?