Differentiating “Real-Time” Analytics

Real-Time Analytics

You’ve probably heard some buzz lately around the term “real-time analytics,” perhaps even seen some impressive benchmark numbers. In this article I’ll explain what these systems offer and why these benchmarks don’t necessarily translate to real-world use.

Real-Time Analytics Benchmarks: Are They Real?

As a result of the NoSQL “diaspora,” everything is tunable. At first, the focus was on making consistency tunable, now the trend is to sacrifice durability [1]. When you see benchmarks for “In-Memory” systems, that means “memory only,” i.e., NOT durable. So if Amazon temporarily loses your Availability Zone, it’s all gone.

If you ask these vendors about durability, either they can tune their product to be durable (completely eliminating their impressive performance) or they point you to some other solutions that can be your durable OLTP database and can feed data asynchronously to their In-Memory Real-Time OLAP system. To do so synchronously would require a Two-Phase Commit that would be expensive to implement—both in performance and engineering time—as the solution would have to be customized to span two fairly different systems.

If you’re in the market for a real-time analytics system, here are the questions you need to ask:

Do You Really Need Two Separate Systems?

If one system could do both jobs, wouldn’t that save you money? Clustrix is both a scalable, fault-tolerant OLAP database AND a scalable, fault-tolerant real-time analytics database. We break up SQL queries and parallelize them, similar to Map/Reduce. But unlike Map/Reduce, the queries aren’t statically[2] written in arcane APIs.

The whole reason OLTP and OLAP systems were separated is because “SQL couldn’t scale.” Early-stage startups without a lot of data or load realize they can just “run the reports on the (singular) database” and postpone the big analytics project until the company concept gains some traction. But when your company hits capacity, OLTP and OLAP contend for the database’s resources. What if you didn’t have to spend all that time and money on infrastructure? Now you don’t. Adding Clustrix nodes increases capacity, letting you put that energy into focusing on your market and gaining more casino online traction there.

How “Real Time” is the Data if it’s Written Asynchronously?

The term “real time” is mainly used in contrast with the data warehouse systems that the industry has used for decades. These systems took at least 24 hours to run reports (often doing a full dump and restore).

In-memory systems boast impressive rates of inserts, but they still take time to initially load the entire dataset, which has to be done after every Amazon Availability Zone outage.

Once the data is (re)loaded, you might think that such a system can keep up easily. That impressive write performance is dependent on high levels of concurrency, however, which isn’t applicable because MySQL’s serialized binlog replication creates a single-stream bottleneck.

With Clustrix, on the other hand, your production OLTP data is the exact same data being queried by your OLAP reports. You absolutely cannot get any more real time than that. Other products that promise “real-time analytics” do not provide that level of recency, at least not in their recommended configurations or those used in their impressive benchmarks.


How Complete is the Data at Any Point in “Real Time”?

I’ve previously written about recombining shards into a scalable data warehouse, but what some don’t realize is that even though a single MySQL instance is consistent, a sharded architecture using MySQL’s asynchronous replication is actually eventually consistent.

That type of architecture can result in inconsistent data in the reporting system, such as:

> Double counting (e.g., OLTP data in two shards as it moves from one to the other)

> Partial data (data has arrived from one source, but not yet from the other)

Identifying and handling these exceptions pushes more work on your analytics team.

On Clustrix, analytics queries run in ACID transactions with Multi-Version Concurrency Control, processing all your constantly in-flux data in a consistent snapshot, as it was at the exact point in time when the query was started.

And because of the parallelism in Clustrix, those OLAP queries that used to take hours on a single relational database now take minutes or even seconds, and they get even faster as you add more nodes!

The Complete Solution, Soup to Nuts

The next time you hear about “real-time analytics” solutions, ask them to zoom out for the whole picture of how it fits in your architecture. You’ll find that for the price, there’s an awful lot it doesn’t do. Then come talk to Clustrix, and prepare to be blown away.

[1] The cloud revolution gives us the “cheap” portion of the classic trilemma: “Fast, Cheap, Good: Choose 2,” making cloud storage either fast but volatile (memory based) or durable but slow, not both.

[2] -SQL queries are declarative, thus compiled into a program/”plan” based on the statistics of your data. As your schema and data changes, the plan may improve automatically without having to rewrite the query. Map/Reduce queries would need to be rewritten.