ClustrixDB Query Optimization

query optimization

In this blog we will focus on how ClustrixDB performs query optimization. One of the reasons ClustrixDB exists is because multiple people had asked our founders if RDBMSs can scale out. One of our founders had previously created a scale-out NoSQL database, and he was up for the challenge. Specifically, the ClustrixDB RDBMS had to provide linear scale out of both writes and reads, maintaining relational semantics including ACID transactionality and referential integrity, without sharding and its complications.

As described before, ClustrixDB is able to to linearly scale out via a combination of the following:

  • Horizontal Scaling Secret #1: Automatic Data Distribution
  • Horizontal Scaling Secret #2: Automatic Fan-out of SQL Queries
  • Horizontal Scaling Secret #3: Automatic Data Rebalancing

The multi-patented ClustrixDB Rebalancer is one of the core components providing #1 and #3 above.

Horizontal Scaling Secret #2, Automatic Fan-out of SQL Queries, is handled by the Query Optimizer, and the Query Evaluation Model. We’ll drill into the details of the Query Optimizer this time, and discuss the Query Evaluation Model in an upcoming blog.

What is the ClustrixDB Query Optimizer?

ClustrixDB Query Optimizer executes each query with maximum parallelism and many simultaneous queries with maximum concurrency. ClustrixDB query optimization leverages a distributed query planner, compiler, and distributed shared-nothing execution engine as well for linear horizontal scalability.

The ClustrixDB Query Optimizer is modeled on the Cascades query optimization framework, which is known for leveraging the following:

  • Cost-based Optimization (“CBO”)
    • CBO leverages dynamic estimation of query execution costs, based on statistics of a table’s usage patterns and cardinality. These statistics can be refreshed by running “ANALYZE {TABLE_NAME}.”
  • Extensible Via a Rule Based Mechanism
    • Before CBO, rule-based query optimization (RBO) was the norm. Once CBO gained prominence, RBO fell by the wayside. By selectively allowing deterministic rules, the ClustrixDB Query Optimizer gains the “best of both worlds.”
  • A Top-down Approach
    • In the top-down approach, the design process starts with specifying the global system state and assuming that each component has global knowledge of the system. In the bottom-up approach, on the other hand, the design starts with specifying requirements and capabilities of individual components, and the global behavior is said to emerge out of interactions among constituent components and between components and the environment. For query optimization, by starting with the “big picture,” the top-down approach is better suited to avoid over-complexity and overly expensive query costs.
  • Able to Separate Logical vs. Physical Operators and Properties
    • The Logical model describes what is to be computed, and the Physical model describes how it is to be computed.
  • A Branch-and-Bound Pruning Approach
    • This is an algorithm design paradigm for discrete and combinatorial optimization problems, consisting of a systematic enumeration of candidate solutions by means of state space search. The algorithm explores branches of this search tree, representing subsets of the solution set. Branches are checked against upper and lower estimated bounds on the optimal solution, and discarded if they cannot produce a better solution than the best one found so far by the algorithm.

Because SQL is a declarative language that describes what is to be computed but not how, the ClustrixDB Query Optimizer is tasked with translating SQL into actual algorithms which can be computed. This translation is critical to the performance of the entire system.

For example, if a SQL attempts to JOIN three tables and compute an aggregate operation, the ClustrixDB Query Optimizer must answer the following questions:

  1. In what order should the tables be joined? This can be the difference between the query executing in 1ms or 10 minutes. If, for example, the predicate on one of the tables causes it to return no rows, starting the read from that table is likely to be optimal and fast.
  2. Which indexes should be used? Not using a proper index on a JOIN constraint could be catastrophic, causing broadcast messages and full reads of the second table for each row of the first.
  3. Should the sort/aggregate be non-blocking? Should the sort/aggregate be done in stages, i.e. first on separate nodes and then re-aggregate/re-sort later?

These permutations create a set of query plans known as Search Space. ClustrixDB Query Optimization explores the Search Space determining which plan uses the least amount of database resources. The most common method is to assign costs to each plan, and choose the least expensive.

ClustrixDB Query Optimizer, like many modern query optimization tools based on Cascades, is functionally split into the Model and Search Engine. The Model lists the equivalence transformations (rules) used by the search engine to expand the search space. The Search Engine defines the interfaces between the search engine and the model. It expands the search space, searches for the optimal plan, and is implemented by the Tasks stack waiting to be computed.

ClustrixDB Query Optimization Core Components

The Logical model describes what is to be computed and Physical model describes how it’s to be computed.

An expression consists of:

  • An operator (required)
  • Arguments (possibly none)
  • Inputs (possibly none)

Arguments describe particular characteristics of the operator. There are both logical and physical operators. Every logical operator maps to one or more physical operators.

Physical properties are related to intermediate results, or sub-plans. They describe things like how the data is ordered and how the data is partitioned. It is important to note that logical or physical expressions and groups (see below) do not have physical properties. However, every physical expression has two descriptions related to physical properties:

  1. What physical properties can and can’t be provided to the parent?
  2. What physical properties are required for the input?

Groups correspond to intermediate tables, or equivalent sub-plans of the query. Groups are logical and contain the following:

  1. All the logically equivalent expressions that describe that intermediate table
  2. All the physical implementations of those logical expressions
  3. Winners: A physical expression that had the best cost given a set of physical properties
  4. Logical properties: Which columns and statistics about some columns it is required to produce
  5. Groups are the fundamental data structure in the ClustrixDB Query Optimizer; inputs to operators are always groups (indicated by group #’s), and every expression corresponds to some group

Memo: To perform query optimization, the ClustrixDB Query Optimizer keeps track of the intermediate tables that can be used in computing the final result table while optimizing. Each of these corresponds to a group and the set of all groups for a plan defines the memo. The memo is designed to represent all logical query trees and physical plans in the search space for a given initial query. The memo is a set of groups with one group designated as the final (or top) group. This is the group corresponding to the table of results from the evaluation of the initial query. The Query Optimizer has no explicit representation of all possible query trees. Instead, this memo represents a compact version of all possible query trees.

Rules: The model’s rule set defines the logical and physical search space of the optimizer. The memo is expanded to encompass the full logical and physical search space through the application of rules. The application of rules is a multi-step process of finding a binding in the memo, evaluating a rule condition (if the rule has one) and (if the condition is satisfied) firing the rule, which puts new expressions in the memo. When all rules have been applied, the memo structure will have been expanded to where it conceptually represents the entire search space.

Tasks: There are tasks waiting on a stack to be executed at any point in time during query optimization. Each task frequently pushes additional tasks onto the stack to achieve its goal. The Query Optimizer is done computing once the stack is empty. It begins by taking an input tree and constructing the corresponding initial groups and expressions. It then starts off the search engine by pushing the task OPTIMIZE_GROUP (top group). This causes a chain of events that explores the entire search space, finds the cheapest winners for each group, and finally chooses the cheapest winner in the top group to be its output plan.

Cost Model: The Query Optimizer cost plans use a combination of I/O, CPU usage, and latency. Because ClustrixDB is distributed, total CPU usage and latency are not proportional. Every operator describes a function to compute its costs given its inputs. The optimizer chooses the optimal plan for query optimization by finding the plan with the cheapest cost. Cost is strongly dependent on how many rows the optimizer thinks are going to be flowing through the system. The job of the row estimation subsystem is to take statistical information from our Probability Distributions (“PDs”) and compute an estimated number of rows that will come out of a given expression.

Distributed Considerations: One of the special things about the ClustrixDB Query Optimizer is its ability to optimally process distributed operations. There are two ways to compute an aggregate: non-distributed and distributed.


Non-distributed aggregates:

  1. Read table Foo, which likely has slices on multiple nodes.
  2. Forward those rows from each separate node to the transaction’s originating node.
  3. Insert those rows into a hash container on the originating node, computing the aggregate operation if necessary.
  4. Read the container.
  5. Output to the user.


query optimization image 1


Distributed aggregates:

  1. Compute the sub-plan (under the aggregate), which likely has result rows on multiple nodes.
  2. On each node containing requisite data, insert those rows into a local hash container, computing any necessary aggregate operations locally.
  3. Read the hash container (results) on each node and forward each set of results to the original node.


query optimization image 2


If necessary:

  1. Insert all those rows into a new final hash container, computing the aggregate operation if necessary.
  2. Read that hash container.
  3. Output rows to the user.

The ClustrixDB Query Optimizer must determine which one is better and when. It turns out the gains from distributing the aggregate actually come from potentially sending a lot less data across the network (between nodes). The additional overhead of extra inserts and containers is less than the network latency gains when the reduction factor of the aggregate operation is large.

The Query Optimizer is able to calculate this with the cost model and determine the better approach for any query, which is part of the Distributed Aggregates feature.

ClustrixDB Query Optimizer: Automatically Parallelizing SQL

The ClustrixDB Query Optimizer automatically converts SQL written to work on single-node, single-master MySQL and parallelizes it. Each SQL query is broken into its component parts, and the compiled parts, i.e. functions, are sent directly to the node(s) on which the requisite data resides. This enables ClustrixDB to execute each query with maximum parallelism; data isn’t moved to a central node to be processed, only the results sets of queries run on each node are moved across the wire for final aggregation. Similarly, less complex queries not needing significant amounts of fan-out are run simultaneously and with maximum concurrency.

This query optimization via parallelization of queries as well as minimization of data movement ensures ClustrixDB linearly scales out both writes and reads with each added node.


Here’s some further reading if you want to learn more:


Follow Clustrix

twitter      linkedin      facebook      youtube


Visit our resources or documentation
page for further reading.