The Clustrix Database Query Optimizer: An Intro to Sierra

Meet Sierra, the Clustrix Database Query Optimizer

SQL query optimization is well known to be a difficult problem. At Clustrix, when we began to architect a new type of distributed database, we knew the query optimizer would be critical to system performance. The distributed nature of Clustrix allows us to produce better execution plans than non-distributed systems, but it also adds more complexity to the optimizer as it now has to reason about how data is distributed in the cluster.

The Clustrix query optimizer is called Sierra, which is a cost-driven, extensible optimizer that is a never-ending project. It is modeled off of the Cascades framework that is also used by MS SQL Server and Tandem’s NonStop SQL product.

Sierra works with a simple query:

CREATE TABLE `employees` (
`employee_id` int(11) NOT NULL,
`name` varchar(256) CHARACTER SET utf8,
`salary` int(11) DEFAULT NULL,
PRIMARY KEY (`employee_id`),
KEY `idx_salary` (`salary`)
)

select * from employees where salary < 50000;

This is the classic problem of index seek vs. table scan. We could either: A) Scan all the rows of the base representation and filter rows that fit the predicate, “salary < 50000” or B) Scan only the rows that fit the predicate from the idx_salary representation and then lookup the missing columns from the base representation (the missing column being `name`).

On Clustrix, plan B is nothing more than a join of the two representations with a primary key equality as the join predicate. Sierra’s high-level steps are:

  1. Take the original query and expand the search space to include plan A and B via transformation rules
  2. Find the cardinality of the table and calculate the selectivity of the predicate
  3. Cost both plans with this statistical information
  4. Choose the cheaper plan

Clustrix performs step two by querying per-representation statistics. Selectivity is a critical component of a query optimizer’s ability to find a good plan. Clustrix built a statistics-capturing framework that updates on the fly without interrupting the throughput of the system. More on this in a future post.

In step three, Sierra costs the plan using a combination of I/O, CPU usage, and latency. Remember that Clustrix is distributed, so total CPU usage and latency are not proportional. Cost is a tricky measure; if we ignore latency in the cost factor we can maximize the throughput of the system under heavy load. If we use latency as the cost factor, we will provide more responsive queries when the system is not loaded. A mixture seems best in practice, but we are continuing to tune this metric and have placed knobs in the system so we can adjust to specific customers.

Let’s take a closer look at steps three and four. Here are the two plans with the same predicate. We can compare the costs of the plans by forcing the planner to use an index via a hint.

Plan A

 clustrix>explain select * from employees where salary < 50000;
 +------------------------------------------------------+-----------+-----------+
 | Operation                                            | Est. Cost | Est. Rows |
 +------------------------------------------------------+-----------+-----------+
 | filter (1.salary < param(0))                         |  31003.90 |  12442.00 | 
 |   index_scan 1 := employees.__idx_employees__PRIMARY |  30003.90 |  50000.00 | 
 +------------------------------------------------------+-----------+-----------+
 2 rows in set (0.00 sec)

Plan B

 clustrix>explain select * from employees use index (idx_salary) where salary < 50000;
 +-----------------------------------------------------------------------------------+-----------+-----------+
 | Operation                                                                         | Est. Cost | Est. Rows |
 +-----------------------------------------------------------------------------------+-----------+-----------+
 | nljoin                                                                            |  63458.10 |  12442.00 | 
 |   index_scan 1 := employees.idx_salary, salary < param(0)                         |   7469.10 |  12442.00 | 
 |   index_scan 1 := employees.__idx_employees__PRIMARY, employee_id = 1.employee_id |      4.50 |      1.00 | 
 +-----------------------------------------------------------------------------------+-----------+-----------+
 3 rows in set (0.00 sec)

For this predicate, the optimizer determines plan A costs 31K and plan B costs 63K, and decides to choose plan A. This is because the predicate wasn’t all that selective, it was 12442/50000 ~= 25%. In the execution of plan B, when the predicate is not that selective, a lot of rows will be joined from the idx_salary representation to the base representation. This is pretty costly compared to plan A, where we just scan the base representation and filter rows before completing. Check out the blog post Scaling Distributed Joins by Sergei Tsarev, our Co-Founder and CTO, to learn more about how Clustrix executes joins.

Let’s see what happens with a more selective predicate, “salary < 41000.” Note that we have to use “use index” on plan A instead of plan B this time around:

Plan A

 clustrix>explain select * from employees use index (primary) where salary < 41000;
 +------------------------------------------------------+-----------+-----------+
 | Operation                                            | Est. Cost | Est. Rows |
 +------------------------------------------------------+-----------+-----------+
 | filter (1.salary < param(0))                         |  31003.90 |   1336.00 | 
 |   index_scan 1 := employees.__idx_employees__PRIMARY |  30003.90 |  50000.00 | 
 +------------------------------------------------------+-----------+-----------+
 2 rows in set (0.00 sec)

Plan B

clustrix>explain select * from employees where salary < 41000;

 +-----------------------------------------------------------------------------------+-----------+-----------+
 | Operation                                                                         | Est. Cost | Est. Rows |
 +-----------------------------------------------------------------------------------+-----------+-----------+
 | nljoin                                                                            |   6817.50 |   1336.00 | 
 |   index_scan 1 := employees.idx_salary, salary < param(0)                         |    805.50 |   1336.00 | 
 |   index_scan 1 := employees.__idx_employees__PRIMARY, employee_id = 1.employee_id |      4.50 |      1.00 | 
 +-----------------------------------------------------------------------------------+-----------+-----------+
 3 rows in set (0.00 sec)

This one is quite a bit more selective – it’s 1336/50000 ~= 2%. In this case, plan A costs 31K and plan B costs 6.8K, so the planner chooses plan B. In effect, Sierra realizes that not as many rows need to be joined from the two representations with this predicate and decides that the plan B join is cheaper.

Though Sierra is a powerful query optimizer, it’s not complete and never will be. This is why Clustrix decided on an extensible framework. In fact, Clustrix just recently introduced distributed aggregates to the system with minimal disturbance to the query optimizer.

 

Read more about ClustrixDB query optimization.

 

Follow Clustrix on Twitter (@Clustrix), Facebook, and LinkedIn.

Visit our resources or documentation page for further reading.