When it Comes to Real-Time Analytics, Does Cassandra + Hadoop Equal Clustrix?

My sales team recently requested I check out Cassandra and DataStax Enterprise. A prospective customer had asked us for a comparison, and since we rarely see each other in the field, we’ve never looked closely at them before.

Surprisingly, DataStax appears to have taken a very different path to build a solution that seems to achieve results similar to Clustrix – a single solution that allows you to run transactions and real-time analytics both without having to ETL the data. Let me explain.

Let’s take a look at DataStax’s approach to their solution’s construction.

First, Scale the Writes

Cassandra provides near linear scale for writes. Multiple copies of data in the cluster ensure fault tolerance, and it has a good multi-datacenter story. For queries, it uses a low-level API or CQL, but will likely move toward CQL (the SQL-like query language).

Now, What About Analytics?

Cassandra can scale simple reads and writes but has no support for analytics. CQL doesn’t support joins or aggregates. On the other hand, Hadoop supports analytics and so it seems like a good addition to assemble a complete solution. It also supports Hive, which allows SQL-like queries for analytics.

Problem is, Businesses Want “Real-Time” Analytics, Too

Running Hadoop analytics requires moving the data into HDFS, and this makes the analytics run offline. DataStax runs Hadoop on Cassandra instead of HDFS, making the analytics near real-time. It also removes the single point of failure of HDFS.

The setup is configured as two logical data centers (OLTP and OLAP) in a multi-master setup. All writes are replicated between both the OLTP and OLAP sides. The writes go to OLTP nodes and analytics go to OLAP nodes. This looks somewhat complex.

Clustrix – A Distributed SQL Database

Similar to Cassandra, Clustrix provides near linear scale for reads, writes, and updates. For fault-tolerance, it maintains multiple copies of each piece of data. The copies are immediately consistent within a cluster. It’s peer-to-peer with a single node type and no master node.

Clustrix allows you to run real-time analytics on the same database. Distributed multi-version concurrency control (MVCC) ensures that writes and analytics don’t interfere, and that each query sees a consistent snapshot of the database. Massively parallel processing (MPP) sends the code to the data just like the map-reduce paradigm in Hadoop, ensuring linear speedup as nodes are added.

This comparative diagram illustrates what a similar Clustrix deployment looks like:


The Strengths of Clustrix and DataStax

DataStax used Cassandra to scale writes, and DataStax puts Hadoop on Cassandra to remove data motion. Both solutions are often programmed with CQL and Hive, which provide an SQL-like query language. Together, this solution seems to be a re-invention of a distributed database with slightly different tradeoffs.

Let’s look at the strengths of each solution.

The Transactions – Reads, Writes, and Updates

Clustrix and Cassandra both can scale reads, writes, and updates. Cassandra will give you higher throughput per node.

Clustrix will give casino you transactions with SQL, as well as the ACID guarantees that you expect from your primary database. In either case, you can add more nodes to handle more load.

The Real-Time Analytics – Joins, Aggregates, and Sorts

Cassandra and Hadoop have a much weaker story when it comes to real-time analytics – the analytics with Hadoop are not exactly real-time and don’t scale as well as Clustrix. Clustrix is designed with MPP and is a specialized solution for these queries. It will easily outperform Cassandra and Hadoop in both speed and scalability.

For example, to handle a join, Clustrix will send each row to the node that has the matching next row – maximizing operations close to the data and reducing work. There are no broadcasts and no data moving to a single node to perform joins. Hadoop can’t scale joins the same way since it’s not designed for them.

High Availability

Clustrix and Cassandra both maintain multiple copies of each piece of the data. In case of node failure, the lost copies are regenerated. Cassandra with its weaker consistency guarantees can provide more seamless multi-geography deployment. Clustrix’ transactional guarantees and features (e.g. we support auto-increment fields) require some care for such deployments

Consistency – The Achilles’ Heel of NoSQL

Let’s consider how Clustrix and DataStax would handle this common scenario – someone shifts the expense for an event from the sales to marketing department right as the CFO is in the middle of analyzing quarterly expenses for all departments.

With Clustrix, the database is always consistent. This means that the analytic query will get the expense value from the database state when the query is executed – from either sales or marketing, but never from both or none. Because Cassandra does not have transactional consistency, the analytic query might see the database in an inconsistent state, and therefore, the expense may show up in sales, marketing, neither or both. This mistake is not acceptable for most businesses.

What Solution is Right for You?

If you require very high scale (with datasets in hundreds of terabytes to petabytes with high throughput) but you don’t need transaction guarantees, then Cassandra may be a good fit.

If you don’t need truly real-time analytics and can tolerate some mistakes and more flexibility of map-reduce, then Hadoop on Cassandra may be a good choice. The Cassandra and Hadoop solution seems like a good platform for pulling in other data and doing experimental work.

But if you require is a scalable SQL relational database that can both scale transactions and accelerate real-time analytics both, then Clustrix is the right solution – designed precisely for this use case.