There are several databases and data management platforms on the market. We created a series of landscapes to help our customers determine the right solution to their problem. Whether primary or analytics, SQL or NoSQL comparing features gives clarity.
All Database Landscape
The All Database Landscape shows the different databases available to a user today. There are different storage required depending on the structure of the data, where in the lifetime the data is and what kinds of workload is run against the data.
Businesses typically run their primary business on a primary database. These are usually able to take a high volume of writes. Some of these also provide real-time analytics, but others don’t, and this is a critical differentiating factor. Legacy SQL databases are being challenged by both NoSQL and NewSQL databases, which provide scale-out for growth. These newer databases are the primary choice for most new applications being developed today and are slowly replacing some legacy installations.
The analytics platforms have offline data that is Extract, Transform, Load (ETL) from a primary database and aggregated from other sources. These are optimized for fast and scalable analytics on high volumes of historic data and can be thought of as batch processing systems, especially with Hadoop.
The file stores provide cost efficient and scalable storage. These are primarily used for blob storage. For example, an online file storage solution would store the files here, but it uses a primary database to keep information about users, file versioning, etc.
The specialized data stores are narrowly focused on a particular problem and are usually put alongside the primary database to handle a very specialized use cases such as text search for Solr, multiple caching solutions and graph database for finding social connections.
Scale-out Primary Database Landscape
Scale-out primary databases are dominating the new application development. As companies build new systems for the cloud, they’re looking for scale-out solutions that can grow incrementally and with flexibility. These applications are expected to handle millions of users or high volumes of machine-generated data and therefore need to scale predictably and fast.
However, in terms of the features provided by scale-out platforms, it can be widely variable. The most feature-rich solution might not be the best; it really depends on the use case one is trying to solve. But it does seem that with the mindshare behind NoSQL, some developers might not have paid attention to the NewSQL databases that provide the scale-out like NoSQL along with SQL and ACID guarantees that have served us well for decades.
The key value stores are used for storing simple data and caching and are widely used in the industry. However, for applications of any complexity, there needs to be a richer accompanying database.
The tabular stores, primarily developed following Google’s Bigtable paper, allow flexible schemas where a row only needs to have a subset of the columns. Cassandra, for example, provides very good write throughput with tunable eventual consistency at the row level. When compared to relational databases, these provide no analytic queries and are not able to do multi-row transactions. This isn’t acceptable for many applications, and therefore some users choose to build such logic into their application.
Relational or NewSQL Databases
The relational or NewSQL databases are newer and provide distributed transactions, analytics, and reporting. At one end, VoltDB and MemSQL are good at transactions and have minimal support for analytics (they both don’t fully support joins). On the other is Clustrix, which supports scalable transactions and massively parallel processing for analytics. This is similar to what is found in columnar analytics databases. Clustrix is the most feature rich scale-out database in the market.
Whether you want a simple key-value store or a fully functional relational database, there are a variety of good options to fit your application.
The SQL database landscape is increasingly complex and difficult to conceptualize. With the market shift away from scale-up databases, many new players have emerged that are highly specialized. Yet, for many applications, the ideal is a single database that supports transactions, runs real-time queries on the latest data and can grow by adding nodes. In the graphic below, we have distilled the landscape to highlight the capabilities of major SQL databases, including Clustrix.
Many of the above databases are complementary and whichone you use depends on the particular workload you’re running. Let’s look at each category in detail.
In-memory Row Stores
These databases are entirely held in-memory (RAM) to support fast transactions. The data may also be distributed and replicated across multiple nodes. This type of database is designed for smaller data sets (<1 TB). These databases do very fast point reads and writes, especially when low latency is required (as an example, < 10ms). However, they have very limited support for analytic queries.
In-memory Column Stores
These databases are entirely held in-memory (RAM) and the data is held as columns for fast analytics. So far, the only available product in this category is SAP Hana.
Single Node Row Stores
This group contains all the primary databases that are good at both transactions and analytics, but do not scale beyond a single server. Some developers use sharding to scale them beyond a single node, but at the considerable cost of additional development and administration overhead. These databases were architected many years ago to work on a single node.
NoSQL Write Databases
These databases are designed for unstructured data, non-relational workloads. However, they are not designed for concurrency or complex write loads. They also do not support complex analytic queries. Most do not even support basic constructs such as joins.
Shared Data Row Stores
These have multiple query processing nodes but a single data nodes. The query processing nodes pull data to process queries and push back updated data. These solutions deliver high availability but do not work well with high concurrency since multiple nodes can write the same data table at the same time. Also, a single node is used to process a query, which hurts the performance of analytic queries.
Shared Nothing Row Stores
Clustrix is the only database in this category. Databases in this category are able to scale transactions and support real-time analytics. Row orientation allows for scalable transaction performance. Massively Parallel Processing (MPP) allows them to run fast real-time analytics. Data sets can range up to 100 terabytes. Beyond that, for an analytics workload, columnar storage and compression become critical.
Shared Nothing Column Stores
These databases are designed for offline analytics. With columnar compression and by reading only the columns the queries requires, they are able to scale from 100s of terabytes to petabytes of data. However, due to columnar orientation, these databases are not able to support transactions or fast writes (some of them allow fast appends, but not fast updates). They usually rely on ETL from primary transactional databases.
Amazon Database-as-a-Service Landscape
Amazon has developed a rich ecosystem of services. Within this ecosystem, it’s easy to scale application servers and increasingly easy to deploy applications. However, at the database layer, there has been a significant gap.
For analytics, there is Amazon Redshift — the columnar SQL database and Amazon EMR (elastic map reduce) for more flexible and unstructured workloads.
For the primary database, users have scale-out NoSQL solution, Amazon DynamoDB, which provides key value lookup. There is also single node SQL solution, Amazon RDS, which provides MySQL, SQL Server or Oracle with space up to 1TB.
However, there are no services that provide a scale-out SQL primary database when data needs to scale beyond the 1TB (or the maximum query load) that RDS provides. Clustrix fits in this segment by providing scale for transactions, acceleration for analytics, and high availability through having multiple data copies.
Similar to AWS, most cloud vendors are missing this critical component in the data layer, limiting the adoption by a whole category of applications.
The Future of the Database