Five-Nines Availability on a Cloud Database in Real World Applications

Five-Nines Availability on a Cloud Database

One thing I really enjoy about my role at Clustrix is the exposure to a wide variety of customer environments and applications around the world. While supporting our customers I’ve become familiar with the challenges they face in developing their own products and, with a bit of luck, help them achieve their own success by getting more mileage out of their ClustrixDB installation.

Everyone on our team has this philosophy. We’re here to help, so we were excited when Gartner validated this when they reported that our customers “…were extremely satisfied in their interactions with Clustrix” — Gartner 2013 Magic Quadrant for Operational Database Systems.

You Get What You Measure

Early this year we launched an initiative to measure our demonstrated availability as an HA database. We wanted to prove to ourselves that our marketing department wasn’t getting ahead of our product’s capabilities. But really we wanted to measure as the first step of further improvement.

From the very beginning, we built ClustrixDB to automate recovery processes that require manual intervention on other systems. Automated recovery is not only necessary for truly high-availability, but also to be simple. Easy. A go-back-to-sleep kind of system that won’t have you working all night to rebuild a failed server.

Carrier Grade

The industry benchmark for measuring availability is expressed in nines as a percentage of time spent available. Five-nines availability (99.999%) is considered really good. Carrier grade good. To achieve five-nines availability in a given month you either have zero issues or recover from any and all issues in less than 25 seconds total. There is no reliable way for humans to respond to issues, let alone resolve them, in that amount of time. Thus a five-nines availability system must recover automatically and fail infrequently.

Our challenge with ClustrixDB was complicated by our desire to create a scale-out cluster. As systems get larger and more complicated, the odds of failure increases. Beating these odds requires high-quality components, careful testing, and of course software that knows how to recover from failures automatically.

Being Our Own Worst Critic

When we started counting time spent unavailable, we decided to take the harshest, most critical view we could on our availability. We counted unavailable time during all maintenance windows required by Clustrix. We counted poor availability as a result of the customer overloading the system. We counted all supported systems including our customer’s development and staging environments. Any outage caused by our software or that could be handled better in future versions of our software, we counted.

We discovered that roughly 80% of all clusters enjoy 100% availability in a given month. The other 20% had issues related to hardware failure, customer workload, or even operator error. This data really helped us prioritize improvements to our software and support procedures. Internally publishing our monthly availability score and discussing the outages that were holding us back, we kept the entire organization focused on the nines.

Achievement Unlocked

Recently, our average availability score crossed into five-nines territory. It’s a milestone we plan to hold on to. And while there may be bumps in the road, our upward trajectory gives us confidence that we are doing the right things for our product and our customers.