Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.
Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
Let's assume for discussion purposes that we are using MongoDB across three availability zones in a region. We would have a replica set member in each of the three zones. One member will be elected primary at a given point in time.
All writes will be sent to the primary, and then propagate to secondaries from there. Thus, writes are often inter-zone. However availability zones are fairly low latency (I assume the context here is EC2).
Reads can be either to the primary, if immediate/strong consistency semantics are desired, or to the local zone member, if eventually consistent read semantics are acceptable.
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
- one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
- both zones continue to satisfy reads, but refuse writes until repaired
- data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
- both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Let's assume again we are using three zones - we could use two but three is more interesting. To be primary in a replica set, the primary must be visible to a majority of the members of the set: in this case, two thirds of the members, or two thirds of the zones. If one zone is partitioned from the other two, what will happen is: a member in the 2 zone side of the partition will become primary, if not already. It will be available for reads and writes.Question 3: Appending a movie to the favorites list
The minority partition will not service writes. Eventually consistent reads are still possible in the minority partition.
Once the partition heals, the servers automatically reconcile.
http://www.mongodb.org/display/DOCS/Replica+Set+Design+Concepts
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions?
MongoDB supports atomic operations on single documents via both its $ operators ($set, $inc) and also by compare-and-swap operations. In MongoDB one could model the list as a document per favorite, or, put all the favorites in a single BSON object. In both cases atomic operations free of race conditions are possible. http://www.mongodb.org/display/DOCS/Atomic+Operations
This is why mongodb elects a node primary: to facilitate these atomic operations for use cases where these semantics are required.
Yes.
A single BSON document must be under the limit -- currently that limit is 8MB. If larger than this, one should consider modeling as multiple documents during schema design.
and are queries by attribute/value supported?
Yes. For performance, MongoDB supports secondary (composite) indices.
The general assumption is that the storage system is reliable. Thus, one would normally use a RAID with mirroring, or a service like EBS which has intrinsic mirroring.
However, the BSON format has a reasonable amount of structure to it. It is highly probable, although not certain, that a corrupt object would be detected and an error reported. This could then be correct with a database repair operation.
Note: the above assumes an actual storage system fault. Another case of interest is simply a hard crash of the server. MongoDB 1.6 requires a --repair after this. MongoDB v1.8 (pending) is crash-safe in its storage engine via journaling.
The most used method is to have a replica which is used for backups only; perhaps an inexpensive server or VM. This node can be taken offline at any time and any backup strategy used. Once re-enabled, it will catch back up. http://www.mongodb.org/display/DOCS/Backups
With something like EBS, quick snapshotting is possible using the fsync-and-lock command.
One can stop the server(s), restore the old data file images, and restart.
MongoDB supports a slaveDelay option which allows one to force a replica to stay a certain number of hours behind realtime. This is a good way to maintain a rolling backup in case of someone "fat-fingering" a database operation.
This shared a number of characteristics with a sharded MySQL setup including the use of out-of-service replicas for backup purposes, and the existence of a primary. A key difference is that the primary is elected and managed by the system rather than by a human operator. However, the ability to direct a query to the primary allows the implementation of a web service that can provide consistent read-after-write behavior. Such a web service can return a short-lived cookie; a client that requires read-after-write consistency returns the cookie on the subsequent request. The cookie can contain the id's of objects that have been recently modified, and when those modifications took place. The cookie lifetime is chosen to be relatively certain that replication has occurred before the cookie's expiration. When the service receives a request with the cookie, if the request is to read an object listed in the cookie, and the request is within a period too short to be reasonably certain of replication across all replicas, the service can direct the read to the primary to see the recently written content. It's not guaranteed, but its pretty good, and other measures can be taken to preclude duplication (unique indexes); I've done this before with MySQL-based web services to direct read-after-writes to the master.
ReplyDelete