Beyond ETL: How to Leverage 5x More Data on Every Bid

An Overview of Linearly Scalable, High-Performance Solutions

The ability to effectively target consumers and optimize the performance of a single ad or entire campaign in real time is the primary goal of every digital advertising company. To meet this goal, advertisers need more than simple performance metrics. They have to analyze massive volumes of data, generate reports and gain detailed insight to continuously refine advertising strategies — all in real time.

To stay competitive in the fast-paced digital advertising industry, companies must provide a massively scalable, feature-rich and high-performance platform to attract and retain customers. However, most analytics platforms are built on databases that simply can’t scale to support detailed analytics or advanced feature sets, which can significantly hinder innovation and customer success.

This white paper explains how digital advertisers can address these critical advertising technology challenges using ClustrixDB and build innovative applications that deliver real-time analytics on live operational data while supporting massive transaction volumes. By leveraging a combination of intelligent data distribution and distributed query processing across commodity hardware, ClustrixDB makes it easy to scale as the database workload grows.

Download the White Paper PDF

Solving the Key Challenges of Advertising Technology Today


The rapid, industry-wide adoption of real-time bidding and targeting presents several technical challenges. First is the need for speed. The bid for an impression must be completed in an extremely short timeframe — about 50-100 milliseconds. And advertisers must frequently generate customized performance reports, use them to create and refine advertising strategies and allocate funds to support those strategies in real time. However, the demand for real-time performance often leads to bottlenecks at the database, which must ingest writes and updates with high throughput and run consistently fast analytics while performing complex transactions.


Accurate targeting requires the ability to correlate offline and real-time data gathered from millions or even billions of sources, such as cookies and previous visits. Storing and analyzing this information depends on a massively scalable database to support these processes.


Traditional databases often struggle to keep up with data growth and analytics that have become extremely detailed and complex. Most companies start with databases such as MySQL and PostgreSQL, which can only run on a single server or node. As the size of data sets and queries grow, it becomes more difficult to deliver the required capacity and processing power. Although sharding these databases is an option, it demands considerable engineering and operational resources, which reduces the time development staff can spend building innovative new features that make their products more competitive.


Some databases such as Vertica can provide high-speed analytics, but only on older data that requires ETL. But the widespread need for real-time results, as well as the extra cost and complexity of adding another database to the infrastructure, often makes this option unfeasible. Some companies use Hadoop to deliver unstructured offline analytics, but because it can only process older data with a long delay, Hadoop isn’t a viable option for companies that require real-time analytics.


While some NewSQL distributed databases such as VoltDB, MemSQL and NuoDB are scale-out and can handle high transaction volumes, they lack robust real-time analytics. They also have limited support for queries involving joins and can’t distribute their processing. Likewise, while existing NoSQL solutions such as MongoDB and Cassandra are effective for simple point read-and-write loads, they lack support for joins and complex queries.

Required Features for Real-Time Analytics

To meet the requirements for scalability, performance and real-time analytics in the digital advertising industry, the primary database must include:

  • Scale-out Architecture – As the load increases, administrators can easily add extra nodes to process the additional demand. This should require no change in the application because data and queries can auto-distribute. Additional reads, writes and updates can also be performed with more nodes.
  • SQL and ACID – The primary database needs to ensure all data is safe and reliable. It should also not require many application code changes or exceed the existing skillset of internal resources. • Fault Tolerance – The primary database should remain available regardless of node failures and geographic region outages.
  • Multi-Version Concurrency Control (MVCC) – Distributed MVCC is a key database capability that ensures analytic queries and writes do not interfere with each other. The analytic query reads a snapshot of the event as it was recorded. The write queries can then write newer versions. The older versions are subsequently deleted when no longer needed.
  • Massively Parallel Processing (MPP) – The database must be able to use multiple nodes and cores on each node in parallel to accelerate analytic queries. If the query evaluation is not distributed and done in parallel while minimizing data movement, the performance will not scale.

Both MVCC and MPP are especially important for processing massive advertising workloads because they allow nodes to be added, which in turn helps to accelerate analytics. The ability to add nodes as demand increases is essential, because it allows businesses to predictably scale without requiring extra engineering or operations time.

Beyond ETL_Image 1

ClustrixDB: Designed for Real-Time Analytics

ClustrixDB is the only database that includes all of the features required to deliver a complete solution for digital advertising companies. In fact, it is the only primary database with MPP to offer real-time analytics. Other capabilities include:


ClustrixDB splits user queries into fragments that work in parallel to accelerate queries. The code goes to where the data is instead of the query node pulling all the data. ClustrixDB is the only transactional database that can send code to the data; all other databases pull data to a single node that does the processing. The following figure illustrates how ClustrixDB processes an analytic query:

Beyond ETL_Image 2


ClustrixDB intelligently distributes multiple copies of data across nodes. When new nodes are added, the data will auto-distribute to the new node. If a node is lost, the missing copies are quickly regenerated. Unlike other databases, ClustrixDB has independent distribution for the primary and secondary indexes, so given a value the administrator can tell which node it resides on. This removes broadcasts for queries such as joins, allowing them to scale linearly.


In performance tests that compared ClustrixDB to MySQL 5.5 on Amazon Web Services (AWS) — with higher and variable network latencies — ClustrixDB was shown to have near-linear speedup.


ClustrixDB accelerates joins and aggregates as nodes are added. The following graphs illustrate the linear speedup enabled by ClustrixDB. Note that joins are accelerated from over three minutes to only 17 seconds. Aggregates accelerate from 19 seconds to two seconds simply by adding nodes. These graphs show the best case for MySQL as the temporary results fit in its memory and no transactions are being run simultaneously.

Beyond ETL Image 3


ClustrixDB makes more cores and memory available as nodes are added, which in turn accelerates database performance because it’s no longer constrained by factors such as:

  • Data set does not fit into memory of single MySQL box (as shown above).
  • Multiple analytic queries and writes happening at the same time.

Since the database is already challenged due to a high number of active queries, the speedup enabled by ClustrixDB is significant. In many cases, queries that once took hours are completed in seconds.

Beyond ETL_Image 4

ClustrixDB Customer Success Stories

ClustrixDB helps leading digital advertising companies accelerate their database queries and simplify operations, allowing them to focus on features instead of the data management infrastructure. Two of these companies include:


AdScience runs complicated algorithms to process
bids for ad space based on click history. The
analytics must be run on real-time data. Prior to
ClustrixDB, AdScience could only process one day’s
worth of history. With ClustrixDB’s fast operational analytics, AdScience can leverage more historical data to select the right bid — increasing revenue potential by five times. AdScience can now run algorithms that leverage five days of click history within a 120-millisecond window.


Engage:BDR helps advertisers run campaigns with a self-service platform that helps precisely target audiences and deliver outstanding ROI. Engage:BDR also enables customers to run custom reports to manage campaigns in real time. Before ClustrixDB, queries often took more than four hours and then had to be abandoned. Using ClustrixDB, they can complete the same queries in less than 15 seconds — a highly attractive benefit to customers. They can also focus more on developing value-added features instead of scaling and maintaining their database.

Advertising Technology: What’s Next?

Companies in the extremely competitive online advertising industry are always looking for an edge that puts them ahead of the game. ClustrixDB, with its scale-out architecture and Massively Parallel Processing, is uniquely capable of supporting the fast pace and complexity of change in the world of digital advertising. Not only is ClustrixDB the right choice, it’s the only choice for companies looking to focus on building innovative new features and go to market faster than ever before.

Take the next step and see what ClustrixDB can do for your business. For a free trial or consultation, contact Clustrix at