Sharding: In Theory and Practice (Part One)

Part One: A Brief History of Sharding

Peter Zaitsev’s keynote at PerconaLive NYC 2012 contained a slide with the text, “sharding is messy.” This admission felt like a tide change to me because so many high-growth technology companies today are firmly entrenched in custom-sharded solutions into which they poured their blood, sweat, and tears.

Clustrix is a scalable database that eliminates the need for sharding, but few realize that because our product is a drop-in replacement for MySQL, we are also a great fit as a component within a sharded environment.

In this blog series, I’m going to describe what real sharded environments look like in practice (not just in theory), and identify several ways to address problems with sharding by implementing Clustrix as a component in an already-sharded architecture.

Where Did Sharding Start?

Before I joined Clustrix four years ago, I worked for several years at SixApart (parent company of blogging services LiveJournal and TypePad). Despite the fact that memcached was invented at LiveJournal, I am reluctant to attribute the invention of sharding to anywhere in particular. Many folks, including myself, were using the same principles of sharding back in the 90s with non-relational databases. So why does LiveJournal deserve mention in the history of sharding?

LiveJournal hit a scalability wall in 2004, but instead of being secretive about their outages, the staff described their challenges in great detail and enlisted the technical community on the site to help solve the problem. The Google File System white paper had just been released one year prior, inspiring many to create open source implementations like LiveJournal’s own MogileFS. But database operations are far more complex than simple file operations; therefore some functionality would have to be sacrificed.

After LiveJournal implemented its new database architecture and proved it could scale (by sacrificing ACID compliance and SQL JOINs), the company routinely presented it as open source on the blogosphere as well as at the annual USENIX conference, where the next generation of startups was listening with attentive ears. Google’s BigTable also began development in 2004, but was more secretive. When the BigTable whitepaper was finally released, it was too tailored to specific needs (i.e. indexing web pages).

Discussions of causality aside, LiveJournal did build what we now call “database sharding,” presented their architecture as open source at OSCON, and since then many Internet startups have followed suit.

In the next blog post in this series, I’ll take you on a deep dive into the design decisions of a sharded database and, in particular, discuss the differences between algorithmic and dynamic sharding.


Part One: A Brief History of Sharding

Part Two: The Differences Between Algorithmic and Dynamic Sharding

Part Three: What’s in a Shard?

Part Four: Using Memcached

Part Five: The Data Warehouse




Follow Clustrix on Twitter (@Clustrix), Facebook, and LinkedIn.

Visit our resources or documentation page for further reading.