A lot of the responses to my previous posts criticized my choice of comparing a document database like MongoDB to a relational database like Clustrix. I tried to examine aspects of the database architecture which have nothing to do with the data model, but somehow the data model would always come up. There is a set of concerns that’s common to all database systems, whether you are a document store or a relational database.
But first, think about how your database would handle the following workload. Chose whatever data model you find best suited for my use case.
get me all the records from foo where a = ? and b = ?
foo has an index over A and another index over B.
A and B have non-uniform data distributions. A is 90% value “X” and B is 90% value “Y”.
We have 1 billion records in foo.
The database has a choice: index A, index B, or scan and filter.
50% of the queries are a = X and b = Z, the other 50% are a = Z and b = Y
What does your database do?
If the database always chooses index A or index B for all queries, then it ends up examining 900,000 rows 50% of the time.
We’ll come back to my question later. First, I will briefly describe what the database stack in Clustrix looks like. I need to set a foundation so that my next set of posts makes sense.
To anyone who has experience with DBMS systems, the stack should look very familiar. We have fairly strict abstraction of interfaces between the various portions of the stack.
The Protocol Handler and Query Parser are responsible for taking in user connections and translating SQL into our intermediate representation calledSierra. We can actually support multiple type of dialects at these two layers; the constraint is that we must be able to express the query language in Sierra. And Sierra is much more expressive than SQL. The value of Sierra is that it provides an extensive planner framework for reasoning about distributed database queries.
So the Planner/Optimizer only accepts Sierra. It runs through a search space of possible plans and prunes them based on cost estimates. After coming up with the best plan candidate, it translates into another intermediate representation used by the Distributed Compiler, which reasons out the physical execution plan for our query. Finally, we compile the query into machine code and execute it.
As a performance optimization, we cache the compiled programs and plans. Clustrix does not need to optimize and compile every query, and you don’t need to use prepared statements to get this behavior.
Back to my question. What did you come up with? Well, I can tell you what happens in Clustrix. During the planning phase of Sierra, we examine the various statistics that the database keeps about the data distribution in our indexes. The statistics include tracking:
- Number of distinct values across index columns
- Quantile distributions
- Hotlist tracking top n values within a column
Clustrix correctly chooses the plan which will result in the least number of rows examined for all input parameters.
Now, don’t get me wrong. I am not claiming that data distribution statistics are unique to Clustrix. On the contrary, they are very common in any modern RDBMS. I’m using it as an example of a requirement that’s independent of any data model, and it’s actually a very important feature to have.