Challenges with Google’s Cloud Spanner

challenges with google cloud spanner

At this point everyone has heard of Google’s Cloud Spanner. It’s Google’s first database offering that can scale-out write transactions linearly, as well as provide cross-node transactions with ACID guarantees, claiming to solve significant problems that Amazon’s Aurora cannot.

But as discussed before, Spanner is able to accomplish that scale by accepting some trade-offs, such as increasing latency with each node (“participant”). And of course the elephant in the room for MySQL developers--Google Spanner is not MySQL compliant, nor even ANSI SQL compliant.

It turns out those aren’t the only trade-offs Google has made with Cloud Spanner. Let’s walk through a more complete list, and then highlight what each means to your MySQL-compatible workload (and business rules).

After spending the better part of a week at Percona Live 2017, I answered the following questions enough times that it’s probably a good idea to write it all down:

Spanner isn’t MySQL compliant, nor even ANSI SQL compliant, and doesn’t plan to be.

We discussed this before, but it’s still worth noting. The biggest interest and use case for Google Spanner’s target customers is migrating workloads from MySQL. But at the Google Next Deep Dive I attended, there was quite a bit of discontent in the crowd at the lack of MySQL support, both currently and in the Spanner roadmap. Migrating a workload from MySQL requires a lot of effort, both design and coding, in order to ensure a performant implementation. And as detailed below, making mistakes in schema design can lead to system downtimes, due to the lack of online schema changes for PK and interleave object mutation.

Cloud Spanner doesn’t provide dynamic referential integrity, except as specified at ‘design time.’

When designing your data model, Spanner requires that parent/child relationships be defined at the table level. The documentation mentions elsewhere that interleaving tables is an option for performance:

You can optionally define parent-child relationships between tables if you want Cloud Spanner to physically co-locate their rows for efficient retrieval -- Cloud Spanner Schema and Data Model

However for referential integrity, child tables must be interleaved. If you do not define child tables as “interleaved,” then referential integrity (cascade updates and deletes) can neither be defined, nor will be automatically enforced. If you later realize that you’d like to change a table into a child (“interleaved”) table, you’re going to have to drop/recreate that table.

Tables must be interleaved to gain the most performance

It turns out that JOINs across non-interleaved tables suffer a performance penalty. Thus, if your workload leverages ad-hoc query generation, any table relationships that aren’t already instantiated as parent/interleaved tables, will have lowered performance on JOINs, even if they use FORCE_INDEX.

Cloud Spanner doesn’t provide online schema changes for keys.

The keys of a table can't change; you can't add a key column to an existing table or remove a key column from an existing table. -- Cloud Spanner Schema and Data Model

This means if you want to change the PK of your table, you’re going to have to drop and replace that table. Which means you have to pause your workload, i.e. take a down-time.

Why would you want to change the PK of your table? If you’ve ‘classically’ designed your table(s)’ PKs to leverage timestamps, auto-increment IDs, or use concatenated (rather than Google-recommended hash) keys, then when business rules change, this can come up. The Spanner documentation highlights this in bold, all-caps, and red.

-- ANTI-PATTERN: USING A COLUMN WHOSE VALUE MONOTONICALLY INCREASES OR
-- DECREASES AS THE FIRST KEY PART  -- Cloud Spanner Schema Design

Even more counter intuitively, if your workload experiences unexpected data growth, the original PK you designated might result in lowered performance in the system. How is that possible? That’s due to Spanner relying on the application to spread the load across multiple servers, specifically:

Be careful when choosing a primary key to not accidentally create hotspots in your database -- Cloud Spanner Best Practices for Schema Design

Some Cloud Spanner schema changes require validation, which is not controlled by MVCC.

For example:

If you're adding NOT NULL to a column, Cloud Spanner will almost immediately begin rejecting writes for new requests that use NULL for the column. If the new schema change ultimately fails for data validation, there may be a period of time when writes were blocked even if they would have been accepted by the old schema. -- Cloud Spanner Schema and Data Model (emphasis added)

Cloud Spanner doesn’t automatically use non-coverage indexes, requiring the use of  FORCE_INDEX directives in your code.

Cloud Spanner does not automatically choose a secondary index if the query requests any columns that are not stored in the index. --Cloud Spanner Secondary Indexes

This means, not only does your application SQL updates have to be converted from SQL to Google’s APIs, but your application programmers’ efforts will have to be bolstered by data architects who know how to both design around the limits of Spanner, including avoiding Spanner-specific anti-patterns which degrade performance. These include changing PKs to hashes, removing auto-increment, and correctly designating parent/child data relationships as “interleaved” tables and interleaving necessary indexes as well at design-time.

All of this information is idiomatic to Spanner, which means your MySQL code conversion efforts will need to take that much longer before you can see the performance benefits of Spanner.

Data migration to Cloud Spanner is tricky.

Not only are the lists of supported datatypes severely reduced from MySQL 5.7, but actually getting the data into Spanner requires carefully designed and scaled-back (read “slow”) ingest:

Avoid writing rows in primary key order. ...Ideally each data load worker would be writing to a different part of the key space.

Partition the ordered key space into ranges, and then have each range processed by a worker batching the data into that range into approximately 1MB or 10MB chunks. -- Google Spanner Schema Design

Note that Quizlet found they had the least amount of errors when they limited themselves to 5MB per insert transaction, and that the data ingest took a “longer than they expected.”

Current Cloud Spanner doesn’t support consistent backups controlled by MVCC or replication across regions today.

To be fair, both of these are on the Spanner roadmap. Cross-region replication is where both the vaunted high-speed Google back-end network custom hardware and private fiber will shine, as will as all those atomic clocks for consistency. It’s just you aren’t able to leverage those yet, so be sure to factor that in when planning an implementation.

Cloud Spanner TCO is larger than users expected.

And a less ‘technical’ consideration is price. Usually as DevOps, DBAs, and Data Architects we don’t usually consider the price of the implementation; we’re working with product features and technical feasibilities. However, at some point the ‘business side’ of your company has to authorize purchase of the actual system(s); whether it’s licensing, support, or DBaaS solutions, all of them have a price, and all of them have to be ‘justified.’ Knowing how to frame your implementation recommendations into pros/cons-based ‘business cases’ will help you greatly, either with your department head or the guys and gals in finance.

That being said, Cloud Spanner costs more than you’d think. At Google Next I talked to a few Beta customers and they were a bit taken aback by how much it costs, i.e. $10-$40k for a test at full scale for their workload, and spiking sharply for any accesses outside of Google Cloud Platform. The Quizlet presentation at Google Next was terrific;  it was implied the Quizlet team got a ‘good price’ from Google.

In Summary

On some level, Google Cloud Spanner harkens back to the time before relational databases, which required the schema to be set at design time. If you know the queries you’re going to run, and how big your data is going to be, deploying  Cloud Spanner is reasonably straightforward. You’re still required to change your code and as well as be responsible for making sure it runs well on their proprietary system (i.e., you’re still locked-in). Once you’re done with that work, your workload should perform much better than on MySQL or Aurora.

However, if your workload and business rules require ongoing flexibility (and who doesn’t in this fast-moving cloud-scale world), you might want to consider a RDBMS that:

  • Is like Spanner:
    • Linearly scales-out writes and read transactions
    • Maintains full ACID guarantees across all the nodes in the cluster
  • Provides more than Spanner:
    • Provides full RDBMS functionalities such as referential integrity
    • Provides online schema changes (i.e., no dump/replacing tables to change PK/FK relationships)
    • Automatically uses available INDEXES
    • Provides consistent backups controlled by MVCC today
    • Provides full ANSI SQL access to DML and DDL
    • Replicates across regions today

 

Edited for clarity and further attribution on May 9th, 2017.

 

Follow Clustrix

twitter      linkedin      facebook      youtube

 

Visit our resources or documentation
page for further reading.