How Clustrix Maintains ACID
ClusterDBA user on StackExchange recently asked the question, “Why can’t RDBMs cluster the way NoSQL does?” The question received several good answers, one of which essentially amounted to ‘ACID is hard in a shared-nothing system,’ and then proceeded to enumerate what was hard about it. While there’s no doubt that maintaining ACID properties in a clustered database can have its challenges, it’s far from impossible. This blog entry is culled from a response I gave in that StackExchange thread and describes how Clustrix maintains ACID in a clustered environment.
Clustrix offers a clustered, shared-nothing scalable database software delivered either as an appliance or an AWS AMI. The nodes of a cluster are homogeneous peers, connected by a relatively high-speed network (Infiniband for the appliance and Ethernet for AWS). Relational data is hash distributed across the nodes on a per-index basis in chunks we call ‘slices.’ Each slice will have two or more replicas spread throughout the cluster for durability in the event of a node or disk failure. Clients connect to any node in the cluster to issue queries using the MySQL wire protocol.
So how do we approach maintaining ACID properties and what tradeoffs do we make to maintain those properties? The original discussion attempted to break apart the individual ACID properties because NoSQL solutions usually involve relaxing one or more of them. It’s a bit unnatural to think about ACID as individual properties because they dovetail together to provide what most of us think about as a relational database, but since that’s how the original discussion went, so will this blog entry.
Clustrix uses a combination of two-phase locking and MVCC to ensure atomicity, which increases the amount of messaging between the nodes participating in a transaction and increases load on those nodes to process the commit protocol. This is part of the overhead for having a distributed system and would limit scalability if every node participated in every transaction. Fortunately for Clustrix, that’s not the case.
Clustrix uses the Paxos consistency protocol that requires participants of a transaction to include the originating node, three logging nodes, and the nodes where data is stored (a single node may serve multiple functions in a given transaction). This means the communication overhead for full table scans is quite high, whereas the simple point selects and updates that make up the majority of applications have a constant overhead. OLAP transactions compute partial aggregates on the node where the data is located and therefore don’t suffer from high overhead. Range operations without aggregates are expensive, and applications that make heavy use of them are probably not well suited to Clustrix.
When we talk about consistency, we’re usually referring to ensuring that relational constraints such as foreign keys are enforced properly. Clustrix implements foreign keys as triggers, which are evaluated at commit time. Big range UPDATE and DELETE operations can hurt our performance due to locking (and the aforementioned communication overhead), but these operations are not common in most applications.
The other part of consistency in a distributed system is maintaining a quorum via the Paxos Consensus Protocol, which ensures that only clusters with the majority of the known nodes are able to take writes. In the event of multiple node failures, it’s possible for a cluster to have quorum and still have missing data if all replicas for a slice are unreachable; transactions that access these slices will fail so long as this condition exists.
Clustrix provides MVCC isolation at the container level. Our atomicity guarantees that all applicable replicas receive a particular write before we report the transaction committed. This, for the most part, reduces the isolation problem to what you’d have in the non-clustered case.
The durability guarantee provided by Clustrix includes normal write ahead logging (WAL) as well as replication of relational data within the cluster. The appliance version of the Clustrix SQL relational database has an NVRAM card for storing the WAL, whereas a lot of single instance databases will improve the performance of their WALs by checkpointing at intervals instead of at each commit. This approach is problematic in a distributed system because it makes ‘replay to where?’ a complicated question. We sidestep this in the appliance by providing a hard durability guarantee.
So What are the Trade-Offs?
What does Clustrix give up to be a distributed system? The answer to this question depends pretty heavily on the application and its implementation details. For example, a primarily OLTP workload may have additional single query latency even as the cluster’s throughput scales nicely. That extra latency could go unnoticed or be a real deal breaker, depending on the application.
The simplest non-trivial queries on Clustrix, point selects, execute in three program fragments requiring two communication hops. On the appliance, this communication is via Infiniband (7us latency), so the majority of the induced latency will be marshaling messages to send. In AWS, the inter-node latency is much higher – in the 5-600us range – and therefore plays a much bigger role (1ms for communication latency alone). OLAP queries are much harder to estimate because Clustrix’s parallelism serves to reduce latency of these queries while the communication overhead from join operations increases it. In general, our customers find that their OLAP queries run with significantly less latency on Clustrix, but it’s hard to quantify that difference.