Adrian Cockcroft's Blog: November 2010

Wednesday, November 17, 2010

NoSQL Netflix Use Case Comparison for Translattice

[There is some discussion of this posting with comments by Michael at Slashdot]

Michael Lyle @mplyle CTO of Translattice kindly provided a set of answers that I have interspersed with the questions below. Translattice isn't technically a NoSQL system, but it isn't a conventional database either. It's a distributed relational SQL database that supports eventual consistency, as Michael puts it:

These answers are for the Translattice Application Platform (TAP)'s database component. Unlike other stores that have answered this question set, TAP contains a relational database that scales out over identical nodes. TAP further allows applications written to the J2EE platform to scale out across the same collection of nodes.

The original set of questions are posted here. Each respondent will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders

While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?

In the Translattice Application Platform, data in relational tables is transparently sharded behind the scenes by the data store. These shards are stored redundantly across the nodes. Reads are satisfied with the most local copy of data available on the network, unless that resource is currently overloaded in which case the system may fall back to reads from more distant locations.

When it comes to writes, applications have the choice on the durability and isolation levels for changes. Each transaction may be made in a fully synchronous, serializable isolation level, or may be made in a locked eventually consistent mode that provides ACID serializable semantics except that durability may be sacrificed if an availability zone fails. A further asynchronous mode allows potentially conflicting changes to be made and allows a user-provided reconciliation function to decide which change "wins". A final commit requires a majority of nodes storing a shard to be available; in the case of the fully synchronous mode this would delay or prevent the return of success if a critical subset of the cluster fails.

Policy mechanisms in the system allow administrators to specify how physical and cloud database instances correspond to administratively-relevant zones. An administrator can choose to require, for instance, that each piece of information is replicated to at least three database nodes across a total of two availability zones. An administrator may also use these mechanisms to require that particular tables or portions of tables must or must not be stored in a given zone (for instance, to meet compliance or security requirements). Within the constraints set by policy, the system tracks usage patterns and places information in the most efficient locations.

Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?

one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
both zones continue to satisfy reads, but refuse writes until repaired
data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
both zones continue to accept writes, and attempt to reconcile any inconsistency on repair

Assuming that the SQL transaction in question is running in the fully synchronous or eventually consistent locked mode, writes will only be allowed in one of the two zones. Reads will continue in both zones, but will only be able to satisfy requests for which at least one replica of the requested data exists in the local zone (policy can be specified to ensure that this is always the case). In the eventually consistent mode, multiple partitioned portions of the system can accept writes and reconcile later. Essentially, any of the above desired modes can be used on a transaction-by-transaction basis depending on application and performance requirements.

Question 3: Appending a movie to the favorites list

If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?

Because fully relational primitives are provided, there can easily be one row in the database per favorite. Read-modify-write of the whole list is not required, and the only practical limits are application-defined.

Any SQL queries are supported against the store, and are transformed by the query planner into an efficient plan to execute the query across the distributed system. Of course, how efficient a query is to execute will depend on the structure of the data and the indexes that an administrator has created. We think this allows for considerable flexibility and business agility as the exact access methods that will be used on the data do not need to be fully determined in advance.

Question 4: Handling Silent Data Corruption

When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

Network activity is protected by cryptographic hash authentication, which provides integrity verification as a side benefit. Distributed transactions also take place through a global consensus protocol that uses hashes to ensure that checkpoints are in a consistent state (this is also how the system maintains transactional integrity and consistency when changes cross many shards). Significant portions of the on-disk data are also presently protected by checksums and allow the database to "fail" a disk if corrupt data is read.

Question 5: Backup and Restore

Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?

A good portion of this relational database's consistency model is implemented through a distributed multi-version concurrency control (MVCC) system. Tuples that are in-use are preserved as the database autovacuum process will not remove tuples until it is guaranteed that no one could be looking at them anymore. This allows a consistent version of the tables as of a point in time to be viewed from within a transaction; so BEGIN TRANSACTION; SELECT ... [or COPY FROM, to backup] ; COMMIT; works. We provide mechanisms to allow database dumps to occur via this type of mechanism.
In the future we are likely to use this mechanism to allow quick snapshots of the entire database and rollbacks to previous snapshot versions (as well as to allow the use of snapshots to stage development versions of application code without affecting production state).

Tuesday, November 09, 2010

NoSQL Netflix Use Case Comparison for Riak

Justin Sheehy @justinsheehy of Basho kindly provided a set of answers that I have interspersed with the questions below.

The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

There are two possibilities with Riak. The first would be to spread a single Riak cluster across all three zones, for example one node in each of three zones. In this case, a single replica of each item would exist in each zone. Whether or not a response needed to wait on cross-zone traffic to complete would depend on the consistency level in the individual request. The second option would require Riak EnterpriseDS, and involves placing a complete cluster in each zone and configuring them to perform inter-cluster replication. This has multiple advantages. Every request would be satisfied entirely locally, and would be independent of latency or availability characteristics across zone boundaries. Another benefit is that (unlike either the first scenario or some other solutions that spread clusters and quorums over a long haul) read requests would not generate any cross-zone traffic at all. For an application with a high percentage of reads, this can make a large difference.

one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
both zones continue to satisfy reads, but refuse writes until repaired
data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
both zones continue to accept writes, and attempt to reconcile any inconsistency on repair

As write-availability is a central goal achieved in Riak, the fourth option will be the observed behavior. This is the case regardless of the strategy chosen for Question 1. In the first strategy, local nodes other than the canonical homes for given data will accept the writes instead, using the hinted-handoff technique. In the second strategy, the local cluster will accept the write, those changes will be replayed across the replication link when the zones are reconnected. In all cases, vector clocks provide a clean way of resolving most inconsistency, and various reconciliation models are available to the user for those cases which cannot be syntactically resolved.

For more information on vector clocks in Riak, see:

http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/

and

http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/

Question 3: Appending a movie to the favorites list

Riak will use vector clocks to recognize causality in race conditions. In the case of two overlapping writes to the same value, Riak will retain both unless explicitly requested to simply overwrite with the last value received. If one client changes A to B and another changes A to C, then (unless told to overwrite) Riak will return both B and C to the client. When that client then modifies the object again, the single descendant "D" that they created will be the new value. For applications such as sets which are mostly added to and rarely deleted from, the application code to perform this reconciliation is trivial and in some cases is simply a set union operation. This would look a bit like this in terms of vector clock ancestry:

http://dl.dropbox.com/u/751099/ndiag1.png

Riak allows values to be of any arbitrary content type, but if the content is in JSON then a JavaScript map/reduce request can be used to query by attribute/value.

Question 4: Handling Silent Data Corruption

When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

Many layers of Riak perform consistency checking, including CRC checking in the persistence engine and object equality in the distributed state machines handling requests. In most cases where corruption can be detected in a given replica of some item, that replica will immediately but asynchronously be fixed via read-repair.

Question 5: Backup and Restore

There are two approaches to back up Riak systems: per-node or whole-cluster. Backing up per-node is the easiest option for many people, and is quite simple. Due to bitcask (the default storage engine) performing writes in an append-only fashion and never re-opening any file for writing once closed, Riak nodes can easily be backed up via the filesystem backup method of your choice. Simply replacing the content of the data directory will reset a node's stored content to what it held at the time. Alternately, a command line backup command is available which will write out a backup of all data on the cluster. This is fairly network and disk intensive and requires somewhere to put a whole-cluster backup, but is very useful for prototyping situations which are not holding enormous amounts of data.

Monday, November 01, 2010

Are we ready for spotcloud yet?

Launched today by Enomaly (@ruv) Spotcloud is a "Cloud Capacity Clearinghouse and Marketplace". There was a lot of discussion on twitter about whether this is really new, and previous attempts to do something similar.

My background in this is that I was working at Sun in 2003/2004 when we were thinking about a marketplace for public grid computing capacity, I was chief architect for Shahin Khan's High Performance Technical Computing group at the time, and we "owned" Grid for Sun. We were both RIFd in the summer of 2004, but some of our projects stayed alive, and @ruv mentioned some of these ideas from Sun surfacing in 2005.

I moved to eBay, and one idea that I tried to get eBay interested in at the time was building a marketplace for compute capacity. The problem was that eBay is a retail product focused company, and had no product managers looking at digitally delivered products. I couldn't find a marketplace manager who understood what I was proposing and thought it might be worth working on. In practice, it was too early, but Amazon had the vision to build a cloud at this time, and eBay could have done the same if it wanted to create a market, rather than make existing markets more efficient.

In 2006 (while I was working at eBay Research Labs) I wrote a blog post about a maturity model for innovation. The key point is:

"the evolution of a marketplace goes from competing on the basis of technology, to competing on service, to competing as a utility, to competing for free. In each step of the evolution, competitors shake out over time and a dominant brand emerges.

To use this as a maturity model, take a market and figure out whether the primary competition is on the basis of technology, service, utility or search"

Today the cloud marketplace is somewhere between the service and utility phases. Each individual cloud has their own specific services and service interfaces, and they have not turned into a standard commodity yet, so we do not have the basis for competition purely on the basis of a Utility (i.e. on service quality - uptime, not on service features).

From this point of view, it is still too early for Spotcloud to take off. Cloud's problem is not "finding generic capacity at low cost" (the cloud utility search problem), the cloud marketplace is still evolving it's differentiated service interfaces towards a common set of functionality and standards. Spotcloud is starting out based on Enomaly's interfaces, and say they will add others, while the market leader is Amazon, who have already implemented their own spot pricing model.

One thing I did learn at eBay, is how hard it is to manage marketplaces. One unfortunate measure of success is that it attracts people whose aim is to make money by manipulating the market rather than contributing to it. There are a lot of non-intuitive details that you have to get right for a marketplace to scale and be robust enough to build and maintain trust, while also having very low "friction" so that it attracts and retains buyers and sellers.

So one way to tell that the marketplace for cloud capacity is viable is when you see eBay entering that marketplace :-)

Adrian Cockcroft's Blog

Archive