Why Clustrix: Database Landscapes

Helping understand your options in the complex database space.

Several databases and data management platforms are now on the market. We created a series of landscapes to help our customers determine the right solution to their problem. Whether primary or analytics, SQL or NoSQL–comparing features gives clarity.

All Database Landscape

The All Database Landscape shows the different databases available to users today. Each has its own storage requirements, depending on the structure of the data, where the data is in the database lifetime, and what kind of workload is run against the data.
All Databases

Primary Databases

Companies typically run their primary business on a primary database, which is usually able to take a high volume of writes. Some of these databases also provide real-time analytics, but others don’t, and this differentiating factor is critical. Legacy SQL databases are being challenged by both NoSQL and NewSQL databases, which provide scale-out capabilities for growth. These newer databases are the primary choice for most new applications being developed today and are slowly replacing some legacy installations.

Analytics Platforms

The analytics platforms have offline data that is Extract, Transform, Load (ETL) from a primary database and aggregated from other sources. These platforms are optimized for fast and scalable analytics on high volumes of historic data and can be thought of as batch processing systems, especially with Hadoop.

File Stores

File stores provide cost-efficient and scalable storage, and are used primarily for blob storage. For example, an online file storage solution would store the files there, but use a primary database to keep information about users, file versioning, etc.

Database Stores

Specialized data stores are narrowly focused on a particular problem and are usually put alongside the primary database to handle very specialized use cases, such as text search for Solr, as multiple caching solutions, and as a graph database for finding social connections.

Scale-out Primary Database Landscape

Scale-out primary databases are dominating the new application development. As companies build new systems for the cloud, they’re looking for scale-out solutions that can grow incrementally and with flexibility. These applications are expected to handle millions of users or high volumes of machine-generated data and therefore need to scale predictably and fast.

Clustrix Infographic: Scale-Out Primary DB Landscape

However, in terms of the features provided by scale-out platforms, they can be widely variable. The most feature-rich solution might not be the best; it really depends on the use case that needs to be solved. But it does seem that with the mindshare behind NoSQL, some developers might not have paid attention to the NewSQL databases that provide scale-out capabilities, such as NoSQL along with SQL and ACID guarantees that have served us well for decades.

Key-Value Stores

Key value stores are used for storing simple data and caching and are widely used in the industry. For applications of any complexity, however, a richer accompanying database is needed.

Tabular Stores

Tabular stores, primarily developed following the publication of Google’s Bigtable paper, allow flexible schemas where a row needs to have only a subset of the columns. Cassandra, for example, provides very good write throughput with tunable eventual consistency at the row level. When compared with relational databases, tabular stores provide no analytic queries and are not able to perform multi row transactions. These inabilities aren’t acceptable for many applications, and therefore some users choose to build such logic into their applications.

Document Stores

Document stores became popular partly because they store JSON which maps easily to objects, especially from the Javascript world. These stores provide flexible nested schemas and indexing, which fits well for document-like structures. They don’t provide joins and therefore can’t work for many use cases as the primary store, since data here is inherently relational (e.g. , users, products, inventories, and orders for e-commerce are inherently related). MongoDB has become popular, but the use case they are targeting is indicated by database-level locks. They expect few small writes and many reads and, to a certain degree, perform well when acting as a cache layer.

Relational or NewSQL Databases

Relational or NewSQL databases are newer and provide distributed transactions, analytics, and reporting. At one end, VoltDB and MemSQL are good at transactions and have minimal support for analytics (they both don’t fully support joins). On the other is Clustrix, which supports scalable transactions and massively parallel processing for analytics, similar to what is found in columnar analytics databases. Clustrix is the most feature-rich scale-out database in the market.

Whether you want a simple key-value store or a fully functional relational database, a variety of good options are available to fit your application.

SQL Databases

The SQL database landscape is increasingly complex and difficult to conceptualize. With the market shift away from scale-up databases, many new players have emerged that are highly specialized. Yet, for many applications, the ideal is a single database that supports transactions, runs real-time queries on the latest data, and can grow by adding nodes. In the graphic below, we have distilled the landscape to highlight the capabilities of major SQL databases, including Clustrix.

SQL DB Jan 8

Many of the above databases are complementary and the one you use depends on the particular workload you’re running. Let’s look at each category in detail.

In-Memory Row Stores

These databases are entirely held in memory (RAM) to support fast transactions. The data may also be distributed and replicated across multiple nodes. This type of database is designed for smaller data sets (<1TB). These databases do very fast point reads and writes, especially when low latency is required (as an example, <10ms). However, they have very limited support for analytic queries.

In-Memory Column Stores

These databases are entirely held in memory (RAM) and the data is held as columns for fast analytics. So far, the only available product in this category is SAP Hana.

Single-Node Row Stores

This group contains all the primary databases that are good at both transactions and analytics, but do not scale beyond a single server. Some developers use sharding to scale them beyond a single node, but at the considerable cost of additional development and administration overhead. These databases were architected many years ago to work on a single node.

NoSQL Write Databases

These databases are designed for unstructured data, nonrelational workloads. However, they are not designed for concurrency or complex write loads. They also do not support complex analytic queries. Most do not even support basic constructs such as joins.

Shared-Data Row Stores

These have multiple query processing nodes but a single data node. The query processing nodes pull data to process queries and push back updated data. These solutions deliver high availability but do not work well with high concurrency, since multiple nodes can write the same data table at the same time. Also, a single node is used to process a query, which hurts the performance of analytic queries.

Shared-Nothing Row Stores

Clustrix is the only database in this category. Databases in this category are able to scale transactions and support real-time analytics. Row orientation allows for scalable transaction performance. Massively parallel processing (MPP) allows them to run fast real-time analytics. Data sets can range up to 100 terabytes. Beyond that, for an analytics workload, columnar storage and compression become critical.

Shared-Nothing Column Stores

These databases are designed for offline analytics. With columnar compression and by reading only the columns the queries require, they are able to scale from hundreds of terabytes to petabytes of data. However, because of their columnar orientation, these databases are not able to support transactions or fast writes (some of them allow fast appends, but not fast updates). They usually rely on ETL from primary transactional databases.

Amazon Database-as-a-Service Landscape

AWS DBaaS Dec 2 2013

Amazon has developed a rich ecosystem of services. Within this ecosystem, it’s easy to scale application servers and increasingly easy to deploy applications. However, a significant gap remains at the database layer.

For analytics, there is Amazon Redshift, the columnar SQL database, and Amazon EMR (elastic map reduce) for more flexible and unstructured workloads.

For the primary database, users have a scale-out NoSQL solution, Amazon DynamoDB, which provides key-value lookup. There is also a single-node SQL solution, Amazon RDS, which provides MySQL, SQL Server, or Oracle with space up to 1TB.

However, no services are available that provide a scale-out SQL primary database when data needs to scale beyond the 1TB (or the maximum query load) that RDS provides. Clustrix fits in this segment by providing scale for transactions, acceleration for analytics, and high availability through multiple data copies.

Similar to AWS, most cloud vendors are missing this critical component in the data layer, limiting adoption by a whole category of applications.

The Future of the Database

Download PDF version