Chapter 12
Tuning HyperDex

Out of the box, HyperDex provides a stable system with high performance. This chapter is dedicated to those who are looking to get even more performance out of their system. In this chapter, we’ll explore system settings that can impact HyperDex’s performance and stability.

As we navigate the available tuning options, keep in mind that these guidelines are general, and may or may not apply to all workloads. Before tuning a cluster, it is advisable to setup an end-to-end benchmark and regularly compare the impact each individual setting has on cluster stability and performance. Generally, the best benchmarks are derived directly from a HyperDex application, but the YCSB Benchmark (Section ??) provides a set of workloads common to many applications and will work nearly as well.

12.1 Filesystem

The filesystem is the subsystem to tune to improve performance. Because most operating systems are designed to accommodate a wide variety of workloads, they usually will not provide optimal performance for a key-value store.

12.1.1 Choosing the Right Filesystem

When deploying a new cluster, the filesystem setup and configuration is the first opportunity for optimization. HyperDex works with any POSIX-compatible filesystem because it relies on simple and portable primitives. Filesystems with complex features that extend beyond the basic POSIX feature set (such as those provided by ZFS and btrfs) can impose performance penalties if such features are not used in the deployment. Simpler filesystems such as ext3, ext4, and XFS are usually a better choice because they are commonly available on Linux distributions, have received extensive testing, and provide high performance.

To chose the best filesystem for a new deployment, start by consulting the appropriate documentation for each filesystem. The following filesystems are listed in order of preference:

If you still have doubts, it is best to go with XFS, ext4, or UFS, depending on their availability on your platform. Check your distribution’s documentation for details.

12.1.2 Improving Read Performance by Disabling Access Time Updates

By default, most filesystems update the access time on a file each time a file is accessed. For key-value workloads, these access time updates convert each read into a meta-data write. This can turn a relatively inexpensive cached read into an expensive disk write—which will, in turn, impact the performance of reads from disk, and other writes to disk.

Because HyperDex does not rely upon access times for correctness, we can obtain higher read performance by disabling access-time updates. On XFS filesystems, this behavior has been the default since 2006, so there is no need to tune this setting. Consult the filesystem documentation for details on other filesystems. For example, on ext4, this behavior may be obtained by setting the relatime flag in /etc/fstab for the partition with HyperDex data:

/dev/sdb1 /hyperdex/data/dir ext4 defaults,relatime 0 2

12.1.3 Improving ext3/4 Write Performance with data=writeback

The ext3 and ext4 filesystems are journaled filesystems, which means that they will often impose an order on disk operations to improve integrity during crashes. HyperDex relies upon application-level integrity checks that render this serialization at the filesystem level unnecessary. By relaxing these ordering constraints, we can obtain higher performance on journaled ext systems, comparable to the default behavior of XFS. To do this, add the flag data=writeback to the /etc/fstab:

/dev/sdb1 /hyperdex/data/dir ext4 defaults,data=writeback 0 2

12.1.4 Changing the Linux Disk Scheduler

On Linux, the disk scheduler impacts the order and latency of disk requests. For server workloads, the default scheduler may have undesirable consequences for the 99th percentile latency. A better option is to switch to the noop scheduler on SSDs and the deadline scheduler on spinning disks. To do this switch the scheduler like so (where DEV is your block device):

  # echo noop > /sys/block/DEV/queue/scheduler
  # echo deadline > /sys/block/DEV/queue/scheduler

12.1.5 Linux VM Tuning

The Linux kernel buffers writes to disk, by asynchronously flushing them in the background. On occasion, the size of the write buffer can grow to such an extent that the kernel will use HyperDex’s scheduled time slots to flush data to disk instead, appearing to stall the application for periods of time. There are multiple system controls that can be tuned to avoid this situation and reduce write latency variation.

To properly tune these parameters, you’ll need to know the approximate throughput at which your disk devices will service write requests. This is important, because we are simultaneously tuning the threshold at which the background flush thread will wakeup, and the threshold at which the kernel will begin to block HyperDex. In the below examples, we’ll assume each server has a single SSD capable of 500 MBs.

For more information on tuning the Linux virtual memory subsystem, consult the Linux kernel documentation.

12.2 Improving Stability by Increasing Open File Limits

Internally, HyperDex maintains multiple open file descriptors corresponding to network sockets and data files. Most Linux systems restrict the number of open files to 1024 by default. We recommend setting this value to 65536 or higher. This can be accomplished by adding the following line to /etc/security/limits.conf (the exact file may vary on your distribution):

*        -       nofile          65536

12.3 Deploying on EC2 and Other Xen Environments

Occasionally, Xen-based environments such as EC2 have bugs that demanding HyperDex workloads can expose. When deploying an EC2 cluster, ensure that all nodes are deployed with the EC2 enhanced networking to avoid such bugs. In environments where enhanced networking is unavailable, the bug can be worked around with the following commands:

  # ethtool -K eth0 rx off tx off sg off tso off ufo off gso off gro off lro off
  # sysctl -w net.ipv4.route.flush=1