Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.
Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
There are two possibilities with Riak. The first would be to spread a single Riak cluster across all three zones, for example one node in each of three zones. In this case, a single replica of each item would exist in each zone. Whether or not a response needed to wait on cross-zone traffic to complete would depend on the consistency level in the individual request. The second option would require Riak EnterpriseDS, and involves placing a complete cluster in each zone and configuring them to perform inter-cluster replication. This has multiple advantages. Every request would be satisfied entirely locally, and would be independent of latency or availability characteristics across zone boundaries. Another benefit is that (unlike either the first scenario or some other solutions that spread clusters and quorums over a long haul) read requests would not generate any cross-zone traffic at all. For an application with a high percentage of reads, this can make a large difference.
Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
- one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
- both zones continue to satisfy reads, but refuse writes until repaired
- data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
- both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
As write-availability is a central goal achieved in Riak, the fourth option will be the observed behavior. This is the case regardless of the strategy chosen for Question 1. In the first strategy, local nodes other than the canonical homes for given data will accept the writes instead, using the hinted-handoff technique. In the second strategy, the local cluster will accept the write, those changes will be replayed across the replication link when the zones are reconnected. In all cases, vector clocks provide a clean way of resolving most inconsistency, and various reconciliation models are available to the user for those cases which cannot be syntactically resolved.
For more information on vector clocks in Riak, see:
http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/
and
http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/
Riak will use vector clocks to recognize causality in race conditions. In the case of two overlapping writes to the same value, Riak will retain both unless explicitly requested to simply overwrite with the last value received. If one client changes A to B and another changes A to C, then (unless told to overwrite) Riak will return both B and C to the client. When that client then modifies the object again, the single descendant "D" that they created will be the new value. For applications such as sets which are mostly added to and rarely deleted from, the application code to perform this reconciliation is trivial and in some cases is simply a set union operation. This would look a bit like this in terms of vector clock ancestry:
http://dl.dropbox.com/u/751099/ndiag1.png
Riak allows values to be of any arbitrary content type, but if the content is in JSON then a JavaScript map/reduce request can be used to query by attribute/value.
Many layers of Riak perform consistency checking, including CRC checking in the persistence engine and object equality in the distributed state machines handling requests. In most cases where corruption can be detected in a given replica of some item, that replica will immediately but asynchronously be fixed via read-repair.
There are two approaches to back up Riak systems: per-node or whole-cluster. Backing up per-node is the easiest option for many people, and is quite simple. Due to bitcask (the default storage engine) performing writes in an append-only fashion and never re-opening any file for writing once closed, Riak nodes can easily be backed up via the filesystem backup method of your choice. Simply replacing the content of the data directory will reset a node's stored content to what it held at the time. Alternately, a command line backup command is available which will write out a backup of all data on the cluster. This is fairly network and disk intensive and requires somewhere to put a whole-cluster backup, but is very useful for prototyping situations which are not holding enormous amounts of data.
We should chat. @ruv
ReplyDelete