Optimizing Interrupt Handling on an AWS Cluster

I recently took a hard look at how to configure interrupt handling on a cluster for optimal performance for our shared-nothing RDBMS, ClustrixDB.

First, a little background on our interrupt handling.

One of the performance characteristics of our ClustrixDB engine is a relatively high interrupt rate on our processors due to high-volume message handling. To compensate for these interrupts we dedicate core 0 on each node for interrupt processing and leave the remaining cores for database processing. This works relatively well, but if left unmanaged can still lead to situations where we become bottlenecked on core 0. This is particularly true for nodes with a large number of cores (32 cores per node) and/or virtual environments such as AWS environments where the interrupt rate is higher.

To get around core 0 bottlenecks in the past, we’ve enabled irqbalance in an attempt to spread hardware interrupts across the cores. But irqbalance adds additional overhead and doesn’t actually lead to good balance across all the nodes.

Consider, for example, the following two tests using a four-node cluster (32 core, 64GB nodes).  In the first test, using a high-concurrency workload, core 0 has been driven to high utilization, leaving the other nodes of the cluster underutilized, awaiting interrupt processing from core 0.  In the second test, with irqbalance enabled, overall CPU utilization across the cluster has improved, but now core 0 is underutilized with respect to the rest of the nodes, plus the interrupt handling is now concentrated on 8 different cores conflicting with node processing.

 

Interrupt handling CPU utilization by core over a 5 minute test interval without irqbalance
Figure 1: CPU utilization by core over a 5 minute test interval without irqbalance. There are 32 cores per node, 128 cores in total.
Interrupt handling CPU utilization by core over a 5-minute test interval with irqbalance.
Figure 2:  CPU utilization by core over a 5-minute test interval with irqbalance. There are 32 cores per node, 128 cores in total.

 

The question is whether or not we can do better.

The answer turns out to be yes, if we’re a little smarter in how we configure the nodes.

A potentially better way to manage interrupts.

After significant testing, we have zeroed in on a potentially better solution:

1)  Disable irqbalance.  We’ll manage hardware interrupts and softirqs with an alternative method.

2)  Explicitly ensure that all hardware interrupts that can be managed on core 0, are managed on core 0 using smp_affinity setting.

3)  Use Receive Packet Steering (RPS) to balance softirqs across the cluster, relieving workload from core 0 and maintaining a balance on the overhead across the remaining cores.

For the cluster in these tests, these configurations led to better overall CPU utilization and better core balance.

 

Interrupt handling workload without irqbalance, but with RPS.
Figure 3:  Workload without irqbalance, but with RPS.

Does this make a difference?

Turns out for our clusters on bare metal, using either irqbalance or RPS results in approximately the same performance as documented in the following graphic. In both cases we were able to drive the overall CPU utilization of the cluster to the high 80s before saturating the system. How this will play out on a more modern chip set is yet to be determined. Where the benefit really comes into play is on AWS environments where irqbalance isn’t necessarily an option. In these environments, we’ve been able to significantly improve performance using RPS to balance out the interrupt workload.

 

 Comparison of various interrupt handling balancing on bare metal
Figure 4:  Comparison of various interrupt balancing on bare metal

Performance curves with various interrupt handling balancing on AWS c3.4xlarge
Figure 5: Performance curves with various interrupt balancing on AWS c3.4xlarge

About our interrupt handling test environment

The workload utilized in these tests was the Sysbench version 0.4 pointselects (100% reads).

To enable irqbalance, on each node execute:

service irqbalance start

To disable irqbalance, on each node execute:

service irqbalance stop
chkconfig --level 123456 irqbalance off

To map hardware interrupts to core 0, on each node execute:

for INTERRUPT in $(ls /proc/irq/*/smp_affinity | cut -d/ -f4 ) ; do
   echo " Interrupt ${INTERRUPT}"
   cat /proc/irq/${INTERRUPT}/smp_affinity
   echo 1 > /proc/irq/${INTERRUPT}/smp_affinity
   cat /proc/irq/${INTERRUPT}/smp_affinity
   done

Note that some smp_affinity settings can not be changed and thus the script above will trigger a write error. Those errors can be ignored. This script will update all interrupts which can be updated.

To enable rps, on each node execute:

for x in /sys/class/net/eth0/queues/rx-* ; do 
   echo $x/rps_cpus
   cat $x/rps_cpus
   echo ffffffff > $x/rps_cpus
   cat $x/rps_cpus
   done
   cat /proc/sys/net/core/rps_sock_flow_entries
   echo 65536 > /proc/sys/net/core/rps_sock_flow_entries
   cat /proc/sys/net/core/rps_sock_flow_entries
   for x in /sys/class/net/eth0/queues/rx-* ; do 
   echo $x/rps_cpus
   cat $x/rps_flow_cnt
   echo 8192 > $x/rps_flow_cnt
   cat $x/rps_flow_cnt
   done

More about the interrupt handling tools:

  • irqbalance:  irqbalance is a tool that distributes hardware interrupts across cores for improved system performance. As noted above, it is a mechanism that we have used in the past to work around core 0 bottlenecks. To read more, click here.
  • smp_affinity settings: Using /proc/interrupts settings in the linux kernel, we can map certain hardware interrupt to a specific core. While we reserve core 0 from interrupt processing, it has not been our practice to also specifically map interrupts to core 0.  Core 0 is the default for many hardware interrupts, but not all. By using smp_affinity settings we can be more explicit on the location of our interrupt handling. To read more, click here.
  • Receive Packet Steering (RPS): Unlike irqbalance and smp_affinity settings which deal specifically with hardware interrupts, RPS is a software mechanism that allows one to control where softirq interrupts are processed on a system. To read more, click here.