Fork me on GitHub

Chapter 11
Fault Tolerance

HyperDex recovers from failures automatically. When a node fails, HyperDex detects the failure, removes the failed node, and automatically reintegrates other nodes to restore the desired level of fault tolerance. The coordinator is made fault tolerant using a replicated state machine. It, too, is fault tolerant and can tolerate a configurable number of failures.

Let’s see both kinds of fault tolerance in action.

11.1 Coordinator

First, let’s demonstrate the coordinator’s fault tolerance. Bring up a singly-fault tolerant coordinator using the same procedure as in previous chapters. Note that this single node coordinator can shutdown and restart, but is unavailable while it is shutdown for obvious reasons.

  hyperdex coordinator -f -l -p 1982 -D /path/to/coord1

A HyperDex coordinator becomes more fault tolerant as more nodes are added. A simple majority of the nodes must remain available for the coordinator to make progress. For example, a cluster with three nodes can fail one node at any point in time, while a cluster with five nodes can tolerate two simultaneous failures.

Internally, the HyperDex coordinator uses state machine replication to provide fault tolerance. Running at least five servers is recommended so that the cluster can tolerate at least two failures. To do so, bring additional coordinators online:

  hyperdex coordinator -f -l -p 1983 -c -P 1982 -D /path/to/coord2
  hyperdex coordinator -f -l -p 1984 -c -P 1982 -D /path/to/coord3
  hyperdex coordinator -f -l -p 1985 -c -P 1982 -D /path/to/coord4
  hyperdex coordinator -f -l -p 1986 -c -P 1982 -D /path/to/coord5

For each new server in the cluster, the coordinator will automatically transfer a copy of the initialized state machine and replicate further operations on those nodes as well. In its current form our cluster can tolerate two simultaneous failures or planned shutdowns. Go ahead and shutdown one or more coordinator servers by pressing Control-C. When restarted, they’ll rejoin the cluster and collect any state changes that occurred in their absence.

It’s even possible to completely shutdown and restart the coordinator by shutting down and restarting all nodes in the cluster.

11.2 HyperDex Daemons

HyperDex daemons are organized by the coordinator. In response to failures and shutdowns, the coordinator automatically responds by assigning a new server to take its place and initiating recovery procedures to ensure the new server maintains an up-to-date view of the data.

To see this in action, we first need to start a cluster:

  hyperdex daemon -f --listen= --listen-port=2012 \
                     --coordinator= --coordinator-port=1982 --data=/path/to/data1
  hyperdex daemon -f --listen= --listen-port=2013 \
                     --coordinator= --coordinator-port=1983 --data=/path/to/data2
  hyperdex daemon -f --listen= --listen-port=2014 \
                     --coordinator= --coordinator-port=1984 --data=/path/to/data3

If you look carefully, you’ll see that we’ve asked each daemon to connect to a different coordinator server. Each daemon connects to the coordinator in a fault-tolerant manner. Should the specified server fail after the daemon connects, the daemon will simply connect to another server in the cluster. When a failure of either a coordinator node or a daemon occurs, the coordinator ensures that all servers in the system will eventually be able to make progress.

Let’s see this in action by creating some data items we care deeply about and checking what happens to our data as failures occur:

  >>> a.add_space(’’’
  ... space phonebook
  ... key username
  ... attributes first, last, int phone
  ... subspace first, last, phone
  ... create 8 partitions
  ... tolerate 2 failures
  ... ’’’)
  >>> c.put(’phonebook’, ’jsmith1’, {’phone’: 6075551024})
  >>> c.get(’phonebook’, ’jsmith1’)
  {’first’: ’John’, ’last’: ’Smith’, ’phone’: 6075551024}

So now we have a data item we deeply care about. We certainly would not want our NoSQL store to lose this data item because of a failure. Let’s create a failure by killing one of the three hyperdaemon processes we started in the setup phase of the tutorial. Feel free to use kill -9, there is no requirement that the nodes shut down in an orderly fashion. HyperDex is designed to handle a configurable number of crash failures.

  >>> # kill a node at random
  >>> c.get(’phonebook’, ’jsmith1’)
  {’first’: ’John’, ’last’: ’Smith’, ’phone’: 6075551024}
  >>> c.put(’phonebook’, ’jsmith1’, {’phone’: 6075551025})
  >>> c.get(’phonebook’, ’jsmith1’)
  {’first’: ’John’, ’last’: ’Smith’, ’phone’: 6075551025}
  >>> c.put(’phonebook’, ’jsmith1’, {’phone’: 6075551026})

So, our data is alive and well. Not only that, but the subspace is continuing to operate as normal and handling updates at its usual rate.

Let’s kill one more server.

  >>> # kill a node at random
  >>> c.get(’phonebook’, ’jsmith1’)
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "hyperclient.pyx", line 473, in hyperclient.Client.put ...
  File "hyperclient.pyx", line 499, in hyperclient.Client.async_put ...
  File "hyperclient.pyx", line 255, in hyperclient.DeferredInsert.__cinit__ ...
  hyperclient.HyperClientException: Connection Failure
  >>> c.get(’phonebook’, ’jsmith1’)
  {’first’: ’John’, ’last’: ’Smith’, ’phone’: 6075551026}

Note that the HyperDex API exposes some failures to the clients at the moment, so a client may have to catch HyperClientException and retry the operation. The HyperDex library does not resubmit failed operations on behalf of clients. In this example, behind the scenes, there were two node failures in the triply-replicated space. Each failure was detected, the space was repaired by cleaving out the failed node, and normal operations resumed without data loss.

11.3 Fault Tolerance Thresholds

HyperDex daemons and coordinators each tolerate a configurable number of failures before the system fails completely. For a desired level of fault tolerance f, you need to start f + 1 daemons and 2f + 1 coordinators. Both are able to tolerate more than f failures so long as enough nodes rejoin the cluster to bring the number of failures back under the failure threshold.

11.4 Shutting Down and Restoring a Cluster

On occasion, you might need to completely shutdown a HyperDex cluster. For example, planned power outages or kernel upgrades might require such shutdowns. Note that complete shutdowns, by definition, require every server process to be killed and therefore violate the fault tolerance threshold. Managing and recovering from such shutdowns is in general a non-trivial process, but HyperDex makes it easy to restart all processes without any data loss.

Shutting down a cluster is a three step process. First, stop all client traffic. Second, kill all the daemon processes using SIGHUP, SIGINT, or SIGTERM. The daemons will inform the coordinator of their departure, and the coordinator will track which daemons departed successfully. Finally, kill each coordinator process using SIGHUP, SIGINT, or SIGTERM.

To restore a cluster, all you need to do is first restart the coordinator nodes and then restart the daemon nodes. Because state was cleanly saved to the disk, this restart process will bring back the entire state of the database without data loss.