SQL Can Scale

The Relational Database Meets Cloud-Scale

ClustrixDB proves that relational databases can support  cloud-scale workloads, maintain low latency, high-concurrency, and stay available  during hardware failures.

ClustrixDB rethinks how relational databases can deliver the cloud-scale deployment experience and support the massive transaction volumes that modern applications need. Built from the ground up to scale linearly, operate on any cloud, and out-perform its predecessors, ClustrixDB is not your grandfather’s RDBMS.

 

How ClustrixDB Scales Out

Designed to support high-value, high transaction workloads with low-latency, ClustrixDB is different from your standard relational databases in that it uses a shared-nothing memory and disk architecture, automatic data distribution across all the servers, and automatic parallelization of queries to scale-out SQL. Optimized for performance, ClustrixDB also shifts queries to where the data is in the cluster rather than moving the data around, all of which allowing near-linear scale as your RDBMS cluster grows.

Shared-Nothing Architecture

ClustrixDB is a scale-out relational/SQL database designed with a shared-nothing architecture, the only architecture proven to scale near linearly as you add nodes. Popular cloud-scale data stores, like Hadoop and Redis, can scale-out because they leverage shared-nothing architectures as well. However, these data stores are not databases, and fall short in handling the structured data and cross-node ACID transactions needed in high-value transactions. Traditional relational databases such as MySQL and Aurora leverage single-master, and/or shared-disk architectures, which keep them from scaling-out writes.

The key characteristic of a shared-nothing architecture is that every node owns part of the data, evenly dividing responsibility for reading and writing to the data, and reducing contention.

ClustrixDB also employs independent index distribution, on both primary and secondary keys, using consistent hashing. This allows each node to know exactly on which node the required data resides, without recourse to any kind of ‘leader’ node. All data is only a single hop from the querying node, significantly reducing broadcasts.

In addition, every node has a query compiler and can accept queries. The database engine processes local data within each node. The data map is replicated on all nodes and knows where each primary and secondary key lives.

 

 

Intelligent Data Distribution

ClustrixDB automatically distributes the data across the cluster–all tables are sliced using consistent hashing, and distributed across all the cluster nodes. Slices are similar to shards, but finer grained and managed completely by the patented ClustrixDB Rebalancer. There are a minimum of one mirror (or ‘replica’) of every slice, allowing for high-availability. The ClustrixDB Rebalancer ensures that the data and the workload are distributed evenly across the cluster, operating in the background and avoiding query performance degradation.

The slicing done by ClustrixDB is superior to sharding and other sharding-like approaches used by other databases. Tables are sliced both horizontally and vertically—hash-distributed ranges of rows are distributed across the nodes, as well as different ‘representations’ of tables (E.g., by primary key, by secondary key, by coverage indexes). This fine-grained distribution both maximizes the parallelism of the queries, as well as minimizes hotspots at the storage layer. It is important to remember with sharding each table partition is on its own RDBMS, requiring the application to maintain ACID guarantees for cross-node transactions. With ClustrixDB, the application sees a single logical database, no matter the number of database nodes. Cross-node transactions are always ACID compliant, and referential integrity is always maintained automatically.

 

 

Distributed Query Processing

ClustrixDB brings the query to the data, not the other way around. This approach minimizes data movement across the cluster.

The SQL language wasn’t designed to be multi-threaded. For relational databases using a single write master, this isn’t an issue. With ClustrixDB, since each node in the cluster can accept both writes and reads, we needed to parallelize SQL by breaking the queries into component functions, compiling them, and then distributing them across the cluster.

ClustrixDB distributes these compiled query fragments to the server which contains your data, does the operations where the data already is, and then returns the result to you. For example, if you make a request, it lands on ‘server A’, but if server A doesn’t have your data, we dispatch that request to ‘server B’ or ‘server C’ which has the data, and only return the specific data you want, back to ‘server A’. This allows us to parallelize work across multiple systems and really scale-out the database.

As the number of queries grows, data motion across the cluster is actively minimized, and database operations are evenly distributed automatically, allowing ClustrixDB to scale linearly. This strategy also ensures that only one node is trying to write to any piece of data, thus reducing contention, and opening the ClustrixDB to support massive concurrency while maintaining referential integrity and ACID guarantees across every transaction.