ClustrixDB is a MySQL-compatible ACID RDBMS that linearly scales out both writes and reads. Clustrix has created many architectural innovations over the years that have made this possible. This week we will focus on one of these innovations, the ClustrixDB Rebalancer, that automatically distributes and balances data across nodes while ensuring high availability. Before we go into the details, let’s put this into context.
With last week’s blog, we started exploring how ClustrixDB accomplishes horizontal scaling of both writes and reads without sharding. Why is this important? Because other MySQL scaling strategies have challenges with write scale. Specifically, most — like read slaves and multi-master — provide read scale much better than write scale. Similarly, while sharding is a common solution for write scale, it comes with significant application, business, and DevOps challenges.
ClustrixDB has three core functionalities that help it scale out linearly:
1. Automatic Data Distribution
As tables and indexes are created, they are automatically “sliced” and distributed across all the nodes in the cluster. This slicing uses 64-bit consistent hashing based on the data keys, so each slice’s location and contents are deterministically predictable.
2. Automatic Fan-out of SQL Queries
Each SQL query is pre-parsed, and compiled query fragments are distributed directly to the specific cluster node(s) containing the requisite data, i.e., “bringing the query to the data.”
3. Automatic Data Rebalancing
Storage full and/or node hotspots are automatically handled by reslicing and redistributing data across the cluster.
...and the Clustrix Rebalancer does two of them: automatic data distribution and automatic data rebalancing.
The ClustrixDB Rebalancer: Three Core Functions
The multi-patented ClustrixDB Rebalancer is an automated online system that maintains the healthy distribution of data in the cluster without downtime. What does that mean? In a shared-nothing clustered database, each node needs to have its own (partial) set of the data. And in a highly available (HA) system, there must be at least two separate copies of the data across the entire cluster, allowing instance failure(s) without data loss. And finally, the database must ensure that the data distribution is fine-grained enough to maximize parallelization, while avoiding excessive data rebalancing — which is a tricky task..
ClustrixDB Rebalancer: Automatic Data Distribution
A core functionality of the ClustrixDB Rebalancer is ensuring table data and indices are correctly sliced and distributed across all the nodes in the cluster. This is especially important when tables and indices are being created.
Ensuring Best Data Distribution
The easiest data distribution would be table-level, i.e., distributing different tables to each node in the cluster. This would make application queries easier to distribute because queries to a single table would all go to the same node, rather than across several nodes. However, this wouldn’t guarantee enough distribution — it would be very easy to create contentions and/or hotspots. If any table suddenly had a big increase of write queries, all the increased queries would go to the same node.
Another type of data distribution is by subset ranges of the data values in the table, usually the PK. The classic case is distributing the USER table by USERNAME, so that users A-F are on node1, G-L on node2, M-S on node3, and T-Z on node4. This can also work well, unless there’s a surge of new users or traffic to users with names similar to each other, i.e., falling onto the same node, and thus creating contention and/or hotspots.
Another type of data distribution is by CREATION_DATE, so that all newest users would end up on the same logical data partition or slice. This can work very well for workloads needing rolling aggregates or roll-ups, because all the newest data would be on a single node. But if that newest data has a surge in traffic, the same kind of contention would occur.
A really fine-grained data distribution would be by row, because that would avoid contention based on ranges or dates. However, that type of data distribution has the potential of increasing complexity, both from the query side (“Which node do I access?”), and from the data-rebalancing side (“What data should go where?”). So we found a way to compensate for that.
ClustrixDB Rebalancer distributes data by row, using a combination of a distribution key that leverages a configurable amount of (composite) key fields, and is consistently hashed in a 64-bit space. This allows configurability of how “dense” each slice will be depending on the cardinality of the table. In addition, the consistent hashing allows a simple data map to predict exactly what data will reside in every slice, simplifying things for the query planner. And of course with ClustrixDB, data location is abstracted from the application, so the application never needs to know which node(s) contains the data it's accessing.
Best Distribution Key
With ClustrixDB, each table’s data is distributed among its slices by the hash value of its distribution key. The distribution key is some prefix of the table’s key columns. Using fewer columns for the distribution key can make certain queries faster because a query that would normally need to broadcast to all slices can instead be targeted to a specific slice. However, if the ratio of rows to unique values (i.e., cardinality) in a representation’s distribution key is high, some slices can be much larger than other slices for a single table. The ClustrixDB Rebalancer automatically corrects this imbalance as well.
Ensuring Best Slicing
ClustrixDB vertically partitions tables and objects such as keys into “representations.” Representations are then horizontally partitioned into “slices.” When a new representation is created, the Rebalancer determines the distribution and placement of the data such that each representation has an appropriate number of slices, and an appropriate distribution key. This ensures the balance of rows across that representation’s slices, and still allows fast queries of specific slices.
Distribute Primary and Secondary Keys
And it’s not just table data that is distributed, but both primary and secondary indices as well. This allows queries to leverage indices which are co-located with the data, avoiding additional cross-node hops and JOINs, and thus maximizing performance.
ClustrixDB Rebalancer: Automatic High Availability
Another core functionality of the ClustrixDB Rebalancer is ensuring all data is redundant across the cluster. This allows the cluster to withstand the loss of node(s) without data loss (number of nodes is configurable via nResiliency). And this is especially important when a cluster is being scaled in, i.e., nodes are being removed.
When a slice has fewer replicas than desired, the Rebalancer will create a new copy of the slice on a different node. This can happen due to a storage error, or if a node leaves the cluster quorum because of a local hardware failure or network partition. This can also happen if the user chooses to change the “max_failures” setting, or what we call “nResiliency.” By default, ClustrixDB operates at max_failures = 1, meaning the cluster can tolerate a single node outage without data loss. Max_failures can be increased to ceiling [(number-of-nodes / 2) - 1], which interestingly enough is one less than the number of remaining nodes required by Paxos to regain quorum.
When a node is to be removed from the cluster, the administrator designates it as soft-failed. The Rebalancer will begin moving replicas from this node to other nodes in the cluster, to ensure that there’s at least two copies (or “replicas”) of each slice, on separate nodes, of the remaining nodes in the cluster. Once the Rebalancer completes its “soft-fail” tasks, the system can “drop” the nodes, releasing them from the cluster (and no longer requiring licenses :-D).
ClustrixDB Rebalancer: Automatic Data Rebalancing
The final core functionality of the ClustrixDB Rebalancer is ensuring all data continues to be balanced across all the nodes. Working clusters typically have mixed workloads, which means over time some nodes will experience data growth and/or access patterns different than the other nodes. And most importantly, when you add additional node(s) to your cluster, the Rebalancer automatically moves data to those new node(s).
If the slices of a representation are not well distributed across the cluster, the Rebalancer will take steps to move them to more optimal locations. The Rebalancer evaluates the load as follows: each slice of a representation is assumed to exert load proportional to its share of the representation’s key-space. The representation is well distributed when the difference between the “most loaded” and “least loaded” nodes is minimal.
Disk Too Full
If any storage device in the cluster is holding more than its share of table data, the Rebalancer will take steps to move replicas from that device to an underutilized device. Before moving any replicas, the Clustrix Rebalancer computes the load imbalance of the cluster’s storage devices. If this imbalance is below a configurable threshold, the Rebalancer will leave things alone, avoiding unnecessary and/or non-performant replica moves.
When a node is to be added to the cluster, the administrator will install ClustrixDB on that node, and add that IP to the cluster. The ClustrixDB Rebalancer, will begin redistributing replicas to that new node in the background, allowing the new node to immediately participate in compiling, planning, routing, and aggregating queries… even though that node has no local data yet. Once ClustrixDB does its distribution, the data has been rebalanced to the newly added node(s), the data maps will be updated and that node(s) processing will include transactions on data it now has locally.
Slice is Too Big
Representations (including table data, and objects such as keys) are partitioned into slices, each of which is assigned a portion of the representation’s rows. If a slice becomes large, the Rebalancer will split the slice into several new slices and distribute the original slice’s rows among them to balance the data load.
Reads from a slice are always directed to a select replica, known as the ranking replica. This gives the highest chance of that replica’s data being in the buffer cache on that node. If reads were distributed to all replicas of each slice, then there would be cached slice data potentially in multiple places in the cluster, and thus inefficiently using the cluster’s buffer cache. It turns out it’s more efficient and faster to route queries across the network to leverage data already in a node’s cache, than it is to load data from disk of the local node. The mechanism to ensure only a single replica of a slice is read from, is the election of “ranking replicas,” governed by the Rebalancer. And even if the replicas of a representation are well distributed, the distribution of ranking replicas can be suboptimal. The Clustrix Rebalancer then adjusts replica rankings so that reads will be well balanced and distributed around the cluster.
ClustrixDB Rebalancer: Automatically Handling Data Distribution, High Availability, and Data Rebalancing
The ClustrixDB Rebalancer takes away the pain of data distribution, both on table creation, table growth, and shifting table usage patterns to create smoother balancing.
As an online process, the Rebalancer effects changes to the cluster with minimal interruption to user operations, thus relieving the cluster administrator from the burden of manually manipulating data placement. Specifically, the equivalent task for sharded systems is “resharding,” which is a painful task for MySQL DevOps, requiring some amount of downtime for some amount of shards, is typically manually intensive, and yet critical to ensuring a sharded system is running well.
ClustrixDB Rebalancer takes that pain away by doing the job for you automatically.