MongoDB versus ClustrixDB
In Part 1 of my comparison, I ran some performance benchmarks to establish that relational systems can scale performance. In this post I would like to focus more on the High Availability and Fault Tolerance aspects of the two systems. The post will go over the approach of each system and what it means for fault tolerance and availability. I will also conduct a test of the claims: I’m going to fail a node by pulling power to see what happens.
A Primer on Clustrix Data Distribution
Clustrix has a fine grained approach to data distribution. The following graphic demonstrates the basic concepts and terminology used by our system. Notice that unlike MongoDB (and many other systems for that matter), Clustrix applies a per-index distribution strategy.
There are many interesting implications for query evaluation and execution in our model, and the topic deserves its own set of posts. For the curious, you can get a brief introduction to our evaluation model from our white paper on the subject. For this post, I’m going to stick to how our model applies to fault tolerance and availability.
You can find documentation for MongoDB’s distribution approach on their website. In brief, MongoDB chooses a single distribution key for the collection. Indexes are co-located with the primary key shard.
Both Clustrix and MongoDB rely on replicas for fault tolerance. A loss of a node results in a loss of some copy of the data which we can find elsewhere in the system. The MongoDB team put together a good set of documentation describing their replication model. Perhaps one of the most salient differences between the two approaches is the granularity of data distribution. The unit of recovery on Clustrix is the replica (a small portion of an index), while the unit of recovery in MongoDB is a full instance of a Replication Set.
For Clustrix, this means that the reprotection operation happens in a many-to-many fashion. Several nodes copy small portions of data from each of their disks to several other nodes onto many disks. The advantages of this approach are:
- No single disk in the system becomes overloaded with writes or reads
- No single node hotspot for driving the reprotect work
- Incremental progress toward full protection
- Independent replica factors for each index (e.g. primary key 3x, indexes 2x)
- Automatic reprotection which doesn’t require operator intervention
- All replicas are always consistent
One of the interesting aspects of the system is the complete automation of every recovery task. It’s built in. I don’t have to do anything to make that happen. So if I have a 10 node system, and a node fails, in an hour or so I will have a completely protected 9 node system without any operator intervention at all. When the 10th node comes back, the system will simply perceive a distribution imbalance and start moving data back onto that node.
While the Replica Sets feature in MongoDB is nicer than replication in say MySQL, it’s still highly manual. So in contrast with the above list for Clustrix, for MongoDB we have:
- Manual intervention to recover from failure
- The data is moved in a one-to-one fashion
- All data within a Replica Set has the same protection factor
- Failures can lead to inconsistency
Both systems rely on having multiple copies of data for Availability. I’ve seen a lot of interesting discussion recently about the CAP theorem and what it means for real-world distributed database systems. It’s another deep topic which really deserves its own set of posts, so I’ll simply link to a couple posts on the subject which I find interesting and illuminating:
- Stonebraker of VoltDB on partition tolerance
- James Hamilton of Amazon/AWS on consistency
At Clustrix, we think that Consistency, Availability, and Performance are much more important than Partition tolerance. Within a cluster, Clustrix keeps availability in the face of node loss while keeping strong consistency guarantees. But we do require that more than half of the nodes in the cluster group membership are online before accepting any user requests. So a cluster provides fully ACID compliant transactional semantics while keeping a high level of performance, but you need majority of the nodes online.
However, Clustrix also offers a lower level of consistency in the way of asynchronous replication between clusters. So if you want to setup a disaster recovery target in another physical location over high-latency link, we’re able to accommodate that mode. It simply means that your backup cluster may be out of date by some number of transactions.
MongoDB has relaxed consistency all around. The Replication Set itself uses an asynchronous online casino canada replication model. The MongoDB guys are upfront about the kinds of anomalies they expose. The end user gets the equivalent of read uncommitted isolation. Mongo’s claim is that they do this because they (1) can achieve higher performance, and (2) “merging back old operations later, after another node has accepted writes, is a hard problem.” Yes. Distributed protocols are a hard problem, but it doesn’t mean you should punt on them.
There’s also a more nuanced discussion to availability. One of the principal design features of Clustrix has been to aim for lock-free operation whenever possible. We have Multi-Version Concurrency Control (MVCC) deeply ingrained in the system. It allows a transaction to see a consistent snapshot of the database without interfering with writes. So a read in our system will not block a write.
Building on top of MVCC, Clustrix has implemented a transactionally safe, lockless, and fully consistent method for moving data in the cluster without blocking any writes to that data. All of this happens completely automatically. No administrator intervention required. So when the Rebalancer decides to move a replica from Node 1 to Node 3, the replica can continue to take writes. We have a mechanism to sync changes to the source replica with the target replica without limiting the replica availability.
Compare that to what many other systems do: read lock the source to get a consistent view for a replica copy. You end up locking out writers for the duration of the data copy. So while your data is available for reads, it is not available for writes.
After a node failure (or to be more precise, a replica failure within a set), MongoDB advocates the following approach:
- Quiesce the master (read lock)
- Flush dirty buffers to disk (fsync)
- Take an LVM snapshot of the resulting files
- Unlock the master
- Move the data files over to the slave
- Let the slave catch up from the snapshot
So a couple of points (a) the MongoDB is not available for writes during steps (1) and (2), and (b) it’s a highly manual process. It reminds me very much of the MySQL best practices for setting up a slave.
I’ve seen a lot of heated debates about consistency, availability, performance, and fault tolerance. These issues are deeply interconnected and it’s difficult to write about any of them in isolation. Clustrix maintains a high level of performance without sacrificing consistency and very high degree of availability. I know that it’s possible to build such a system because we actually built it. And you shouldn’t sacrifice these features in your application because you believe it’s the only way to achieve good performance.