Profile Driven Performance Optimization

Performance Optimization

Recently, one of our support engineers noticed that a customer cluster got a little sluggish during a routine maintenance operation. In looking back at historical data, we noticed that the cluster saw a decrease in transaction throughput. The system should have done its cleanup in the background at a low priority, but that didn’t happen. I thought it would be interesting to describe our process for addressing a performance problem like this one. I will also share the tools we developed and used.

Reproduce, Measure, and Automate

The symptom our support engineer noticed was a “sluggish” system. More accurately, we saw that query latencies increased and overall system throughput dropped. So our first step was to reproduce the problem under some approximation of the customer”s workload. It was a bit tricky to get all of the prerequisite conditions just right, but we finally landed on the right set of circumstances.

Once we reproduced the issue, the next step was to automate the test to ensure that we could consistently demonstrate the problem. It boiled down to using one of our standard performance harnesses with a couple of tweaks.

The Profile

Our most commonly used profiling tool is based on the Linux oprofile infrastructure. But we had to modify it to better suit our system. Why? Because substantial portions of our online casino canada codebase are written in an event driven/continuation passing style programming model. Call graph reporting in ofprofile is highly dependent on the stack, and the C stack often does not contain enough information to provide a valuable analysis in our system.

To get around this problem, we modified oprofile to gather additional information from the system. The key concept was adding a method for tagging a series of continuation calls and getting oprofile to sample these tags. The tag itself includes enough information to map the executing code to a logical system module hierarchy. The screen shot below shows an example output from one of our reporting tools.

Coming back to our original problem, the screen shot shows the output of the profile analysis tool during the dip in performance. The tool allows us to isolate the report to a specific processor. In this case we’re looking at the output for our management core.

Interpreting the results we see that:

  • The execution engine dominates performance, and most of that is in the storage engine
  • The storage engine spends 63% of all its cycles on CRC32 calculations
  • Our particular maintenance task was erroneously bound to the management core
  • The task should have a lower priority

CRC Performance

The root cause of the performance degradation was a scheduling issue. However, it caused me to take a closer look at our crc algorithm. We use the crc implementation from zlib, and it should be reasonably well optimized. Whipping up a quick benchmark, I we see that the zlib implementation takes about 41us to checksum a 32k block. In the test, we were reading about 224MB/s per node. That’s 293ms of checksums in a second. No wonder we saw a performance drop.

I looked around and found a paper from Intel describing their slice-by-eight crc approach. A quick benchmark revealed that the same 32k checksum could be done in 27us — a 37% percent improvement over zlib’s version.

With Nehalem-based processors, Intel introduced an SSE instruction for a hardware based crc computation. I haven”t had a chance to conduct my own benchmarks, but  the following results suggest that we could get a 2-3x speedup over slice by eight.

Source Code

I’m releasing the source for the tools referenced in this post. The user space oprofile daemon replacement requires our database runtime which I am not making public.