Scaling Distributed Joins

Distributed Joins

A frequent question we get about Clustrix involves our ability to efficiently evaluate distributed joins. We face quite a bit of skepticism when we make the claim that we can scale joins in our system. And people are right to be incredulous: many other systems claiming to be shared nothing and massively parallel failed to solve the problem. However, the Clustrix approach is different in several key ways, allowing us to fundamentally solve the problem.

When building a scalable distributed database capable of handling a wide variety of workload types, one must solve two fundamental problems:

  • Data distribution within the system
  • Efficient query evaluation on top of a distributed data model
  • Clustrix addresses both without compromise.

The first key difference between Clustrix and other systems is our data distribution model. Unlike other approaches, such as sharding, Clustrix provides a much more flexible way of distributing data. Let’s compare the Clustrix approach to sharding and traditional data partitioning used by other shared nothing systems. Take the following schema as an example:

CREATE TABLE `users` (
`user_id` int(11) NOT NULL,
`group_id` int(11) DEFAULT NULL,
PRIMARY KEY (`user_id`),
KEY `group_id` (`group_id`)
)

Let’s assume that the table holds billions of rows, far exceeding any single system’s ability to hold the entire data set. In a common approach, we would choose a single key on which to split out data. The obvious choice would be the user id column. So we split the data into three shards based on user id hash, which gives us a nice even distribution of users across our three-node cluster.

Example 1: Sharded Distribution Model

 

 

In order to get a row for a particular user, the system knows exactly which node holds the relevant row because we chose the user_id as our distribution key. However, the model breaks down when we need to look up the row(s) based on group id. Since the group_id column follows the user_id (i.e. the index is co-located with the primary data), we have no choice but to query all nodes for the relevant row(s). We know exactly which node holds a specific user_id, but we have no idea which node holds a specific group_id. We must broadcast our request for data.

In a multi-way join situation across large data sets, the approach falls flat on its face. And it’s not hard to see why. Let’s examine the following example:

CREATE TABLE `tableX` (
`a` int(11) NOT NULL,
`b` int(11) DEFAULT NULL,
`c` int(11) DEFAULT NULL,
`d` varchar(255) DEFAULT NULL,
PRIMARY KEY (`a`),
KEY `b` (`b`),
KEY `c` (`c`),
KEY `d` (`d`)
);

We create six such tables (table1 through table6) holding a million rows each with each one an exact copy of the others. Each column holds a unique value, so a join on any of the columns preserves cardinality.

Let’s see what happens when we evaluate the following query against our data set:

select 1
from table1 t1
join table2 t2 on t1.b = t2.b
join table3 t3 on t2.b = t3.b
where t1.b t2.a t3.a < 0
;

Since our tables are exact copies of each other, and the joins preserve cardinality, we have one million rows flow through each join step, only to be discarded by the final where clause. The approach allows the test to return 0 rows but force the database to perform the join.

Notice how the query joins on secondary keys, a very common query construct in real world workloads. Recall our example of the users table. In order to look up a row based on a secondary key, we must broadcast to all the nodes that hold the data. For every row we read from table1, we must send N = nodes number of messages to the next table, and so forth. At every join stage, for every node, for every row, we must send a broadcast quest to find the next row. And increasing the number of nodes has the opposite effect from what we want—instead of increasing performance, we see degraded performance because the number of broadcast messages has increased.

Additionally, we must consider the query evaluation model. Distributed databases fall into online casino two broad categories: (1) centralized evaluation or (2) distributed and parallel evaluation. In the online transactional space, almost every Clustrix competitor falls into the centralized evaluation category (even if they claim otherwise).

In practical terms, centralized evaluation means that a single node within the cluster is responsible for evaluating a particular query. In a join case, this means that all rows relevant to a join come back to a single node for processing.

Combining a centralized evaluation model with a single key (sharding) data distribution model produces a very inefficient query resolution.

 

 

As you can see from the above example, the Clustrix approach eliminates a single node bottleneck and removes the need for broadcasts. The following charts demonstrate how the two approaches translate into real-world query performance.

As you can see from the chart above, MySQL Cluster offers a very poor scalability curve. As we increase the number of joins, we see a non-linear increase in query resolution times. The result of a centralized evaluation bottleneck coupled with a message complexity scaling problem.

The same example conducted on Clustrix yields substantially different results. Most importantly, notice the scaling curve.

The green line represents a perfect linear scaling of execution time as the number of distributed joins increases. However, with the Clustrix v4.1 software release, Clustrix manages to stay at a constant time as seen in the orange line. How is this possible? The answer lies within the parallel evaluation engine introduced in v4.1. Not only does Clustrix split evaluation across multiple nodes within the cluster, but it also parallelizes evaluation across multiple processors within each node, so every additional join stage gets a different execution core.

Clustrix is the first truly general-purpose transactional database system with horizontal scale across a wide variety of workloads. Our unique data distribution model, coupled with a fully parallel and decentralized evaluation model, allows us to scale Big Data applications without sacrificing features you’ve come to expect from relational databases.

For more information, download our free white paper Driving the New Wave of High Performance Applications: How NewSQL Delivers Speed, Scale and Simplicity.