Chapter 3
Quick Start

This chapter is designed to give you a whirlwind tour of HyperDex. The first section gives a high-level overview of a typical HyperDex deployment and application. Subsequent sections provide step-by-step instructions for how HyperDex could be used to build store data for a hypothetical phone book application.

3.1 High-Level Overview

A HyperDex cluster consists of three types of components: clients, servers, and coordinators.

Applications built on top of HyperDex are the clients. They use the HyperDex API through various bindings for popular languages (e.g. C, C++, Java, and Python) to issue operations, such as get, put, search, delete, etc, to the storage system.

The HyperDex servers are responsible for storing the data in the system. You can have a cluster with as few as just a single server, though typical installations will have dozens to hundreds of servers. These servers store the data in memory and on disk, and they can crash at any time. HyperDex can be configured to tolerate as many failed nodes as desired.

The HyperDex coordinator maintains the entire cluster. This involves making sure that servers are up, detecting failed or slow nodes, taking them out of the system, and replacing them where necessary. The coordinator maintains a critical data structure, the hyperspace map, that establishes the mapping of data to servers. Clients use this map to locate the servers they need to contact, while servers use it to perform object propagation and replication to achieve the application’s desired goals.

In this tutorial, we’ll discuss how to get a HyperDex cluster up and running. In particular, we’ll create a simple space, insert objects into it, retrieve those objects, and then perform queries over these objects.

3.2 Starting the Coordinator

First, we need to start up a coordinator that will oversee the organization of the hyperspace.

The following command starts and initializes the coordinator:

  hyperdex coordinator -f -l 127.0.0.1 -p 1982

The “coordinator” command spawns a new HyperDex coordinator process listening on on the specified address. All other servers are introduced to the HyperDex cluster via this address, so it is important that it be accessible to all servers. For the purposes of this tutorial, binding to the loopback address will suffice, but you will probably want to use a public IP address for a real deployment

At this point, the coordinator is up and running, and we’re ready to start up additional nodes in our cluster.

3.3 Starting HyperDex Daemons

The HyperDex daemons are the workhorse processes that actually house the data in the data store and respond to client requests. Let’s start a daemon on the the same machine as the coordinator:

  hyperdex daemon -f --listen=127.0.0.1 --listen-port=2012 \
                     --coordinator=127.0.0.1 --coordinator-port=1982 --data=/path/to/data

This command runs HyperDex in the foreground, listens for incoming connections on address 127.0.0.1:2012 and connects to the coordinator that we started earlier. This enables the daemon to announce its presence, which in turn enables the coordinator to assign partitions of data to the newly arrived daemon. For example, the coordinator can direct the daemon to take over some of the partitions from an existing daemon, participate in the data replication protocol, and start handling client requests. But since we have no data yet and this is our first node, this particular daemons does not have much work to do.

Note that the second argument (127.0.0.1) is the IP address to which we want to bind the daemon. Once again, we are using a loopback address here for the purposes of the tutorial, while in real life you will undoubtedly want to use one of the IP addresses bound to your internal network. If no address is specified, HyperDex will attempt to autodetect an external address.

The last argument is a pointer to a directory where the daemon will store all data. Each HyperDex daemon must have its own data directory.

It doesn’t hurt to start a few more daemon instances at this point (although it is not required to continue with the tutorial).

You now have a functional HyperDex cluster. It’s time to do something with it.

3.4 Creating a new Space

Let’s imagine that we are building a phone book application. Such a phonebook would end up keeping track of people’s first name, last name, and phone number. And to distinguish unique users, it might assign a user id to each user. In HyperDex, each such collection of data is called a space and is conceptually similar to tables in traditional databases. Let’s see how we can instruct HyperDex to create a suitable space for holding such objects.

First, let’s open a Python shell and connect to HyperDex via the Admin interface:

  >>> import hyperdex.admin
  >>> a = hyperdex.admin.Admin(’127.0.0.1’, 1982)

This line instructs the admin bindings to talk to the coordinator and get the current cluster configuration. There is no need for static configuration files. Clients always receive the most up-to-date configuration, and if the configuration changes, say, due to failures, the coordinator will detect that a client is operating with an out-of-date configuration and instruct it to retry with a fresh config. The only thing a client must know to connect to a cluster is the address of the cluster coordinator.

Let’s programmatically create a new space using the Python bindings:

  >>> a.add_space(’’’
  ... space phonebook
  ... key username
  ... attributes first, last, int phone
  ... subspace first, last, phone
  ... create 8 partitions
  ... tolerate 2 failures
  ... ’’’)
  True

This command creates a new space called “phonebook” that has a key of “username”, and has attributes “first”, “last”, and “phone”. By default, HyperDex treats every attribute as an opaque bytestring, but provides other types as well. Here, we specify that the phone number be treated as an integer. The available datatypes are discussed in Chapter 4.

Even though the objects we specify will have many attributes, HyperDex will not necessarily create one giant hyperdimensional space spanning as many dimensions as attributes. Doing so would not be desirable, as the space would be to large to efficiently map onto a cluster. Instead, we will typically want to create subspaces consisting of smaller numbers of dimensions. The lower number of dimensions enable the mapping from points in space to nodes in the cluster to be more efficient; in particular, fewer nodes need to be contacted during search operations. In this simple example, we create a 3-dimensional subspace for the first, last and phone attributes. HyperDex always implicitly creates a 1-dimensional subspace for the key of objects.

In other NoSQL systems, objects can only be retrieved efficiently by the key under which they were inserted. So an object ¡jsmith, John, Smith, 555-1234¿ can only be retrieved by its key “jsmith”. Subspaces enable HyperDex to retrieve all “John” or “Smith” objects or, even, reverse lookups by phone number. The key serves as an object identifier so that objects may be retrieved or stored efficiently. Internally, the key is used to sequence updates and ensure consistency.

Even though we’ve only deployed one server in this example, we may want to leave room for future growth of our HyperDex cluster. The “create 8 partitions” line specifies that HyperDex will partition the resulting space into 8 regions. As a general rule, the number of partitions should be greater than the number of daemons that will ever join the cluster. It’s perfectly acceptable to omit this line, and HyperDex will partition the cluster into 256 regions.

Since large scale cloud-computing deployments are sure to encounter failures, we will want to safeguard the data in our key-value store by replicating for fault tolerance. The “phonebook” space is configured to tolerate up to two concurrent failures (“tolerate 2 failures”). This instructs HyperDex to protect against up to two failures by creating three replicas of each object. Even if two daemons holding an object fail, there will still be one copy of the object remaining. HyperDex automatically repairs from this one remaining copy.

Finally, it’s possible to create objects using command-line tools that ship with HyperDex. One could have created the “phonebook” space using the command-line:

  hyperdex add-space -h 127.0.0.1 -p 1982 << EOF
     space phonebook
     key username
     attributes first, last, int phone
     subspace first, last, phone
     create 8 partitions
     tolerate 2 failures
  EOF

3.5 Interacting with the “phonebook” Space

Now that we have our hyperspace defined and ready to go, it’s time to insert some information into our phonebook.

Continuing in the same Python session, lets open a new Client connection to the cluster. HyperDex separates the metadata manipulation provided by the Admin library from the data manipulation provided by the Client library.

  >>> import hyperdex.client
  >>> c = hyperdex.client.Client(’127.0.0.1’, 1982)

We can create an object by doing the following:

  >>> c.put(’phonebook’, ’jsmith1’, {’first’: ’John’, ’last’: ’Smith’,
  ...                                ’phone’: 6075551024})
  True

This operation will determine the right spot in the hyperspace for this object and issue the put operation to all replicas. The operation will only return once the object has been committed at all requisite nodes.

We can easily retrieve the same “jsmith” object by using a standard get.

  >>> c.get(’phonebook’, ’jsmith1’)
  {’first’: ’John’, ’last’: ’Smith’, ’phone’: 6075551024}

Yay, we inserted an object under the key “jsmith1” and retrieved it using the same key. This looks exactly like every other NoSQL store out there, but there are a few differences.

First, it’s blazingly fast. You can look at our latest performance graphs for the precise comparisons, but typically, HyperDex is just way faster than other key-value stores.

Second, it’s fault-tolerant. When we performed the put, our operation was sent through a value-dependent chain of daemons assigned to the object. The client received an acknowledgment only when the object was replicated on every single server in the chain. Unlike NoSQL stores that optimistically assume that an update was committed before reaching all servers, HyperDex responds only when all daemons have been updated. And we can pick the replication level to achieve any level of fault tolerance we desire.

Finally, it’s consistent. If we had multiple concurrent put operations being issued by multiple clients at the same time, we would never see an inconsistent state. What is an inconsistent state? It’s what you get when you settle for eventual consistency. For instance, we would not want a prescription tracking system to say that we dispensed a drug, then to say we did not, only to settle on (say) having dispensed it. Yet this is precisely what might happen with an eventually consistent NoSQL key-value store. Eventual consistency is no consistency at all. In contrast, HyperDex provides linearizability. Time will never roll backwards from the point of any client.

And it gets better. For we can not only retrieve objects by their key, but we can also retrieve them when we don’t know their key. Here are some examples:

  >>> [x for x in c.search(’phonebook’, {’first’: ’John’})]
  [{’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075551024,
    ’username’: ’jsmith1’}]
  >>> [x for x in c.search(’phonebook’, {’last’: ’Smith’})]
  [{’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075551024,
    ’username’: ’jsmith1’}]

Let’s do that reverse phone number search:

  >>> [x for x in c.search(’phonebook’, {’phone’: 6075551024})]
  [{’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075551024,
    ’username’: ’jsmith1’}]

Here’s a fully-qualified search. Hyperspace hashing makes this nearly as fast as a key-based lookup:

  >>> [x for x in c.search(’phonebook’,
  ...  {’first’: ’John’, ’last’: ’Smith’, ’phone’: 6075551024})]
  [{’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075551024,
    ’username’: ’jsmith1’}]

Let’s add another user named “John Doe”:

  >>> c.put(’phonebook’, ’jd’, {’first’: ’John’, ’last’: ’Doe’, ’phone’: 6075557878})
  True
  >>> [x for x in c.search(’phonebook’,
  ...  {’first’: ’John’, ’last’: ’Smith’, ’phone’: 6075551024})]
  [{’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075551024,
    ’username’: ’jsmith1’}]
  >>> [x for x in c.search(’phonebook’, {’first’: ’John’})]
  [{’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075551024,
    ’username’: ’jsmith1’},
   {’first’: ’John’,
    ’last’: ’Doe’,
    ’phone’: 6075557878,
    ’username’: ’jd’}]
  >>> [x for x in c.search(’phonebook’, {’last’: ’Smith’})]
  [{’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075551024,
    ’username’: ’jsmith1’}]
  >>> [x for x in c.search(’phonebook’, {’last’: ’Doe’})]
  [{’first’: ’John’,
    ’last’: ’Doe’,
    ’phone’: 6075557878,
    ’username’: ’jd’}]

Should John Doe decide he no longer wants to be listed in the phonebook, it’s trivial to remove his listing:

  >>> c.delete(’phonebook’, ’jd’)
  True
  >>> [x for x in c.search(’phonebook’, {’first’: ’John’})]
  [{’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075551024,
    ’username’: ’jsmith1’}]

Suppose John Smith needs to change his phone number. This is easily accomplished by specifying just the key for the object and the changed attribute. All other attributes will be preserved.

  >>> c.put(’phonebook’, ’jsmith1’, {’phone’: 6075552048})
  True
  >>> c.get(’phonebook’, ’jsmith1’)
  {’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075552048}

Smith is a popular name. Let’s say there was “John Smith” from Rochester (area code 585):

  >>> c.put(’phonebook’, ’jsmith2’,
  ...          {’first’: ’John’, ’last’: ’Smith’, ’phone’: 5855552048})
  True
  >>> c.get(’phonebook’, ’jsmith2’)
  {’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 5855552048}

Suppose we want to locate everyone named “John Smith” from Ithaca (area code 607). We can do this with a range query in HyperDex.

     >>> [x for x in c.search(’phonebook’,
     ...  {’last’: ’Smith’, ’phone’: (6070000000, 6080000000)})]
     [{’first’: ’John’,
       ’last’: ’Smith’,
       ’phone’: 6075552048,
       ’username’: ’jsmith1’}]

Or perhaps we want to search for everyone whose name falls between “’Jack’“ and “’Joseph’“:

  >>> [x for x in c.search(’phonebook’,
  ...  {’first’: (’Jack’, ’Joseph’)})]
  [{’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 6075552048,
    ’username’: ’jsmith1’},
   {’first’: ’John’,
    ’last’: ’Smith’,
    ’phone’: 5855552048,
    ’username’: ’jsmith2’}]

3.6 Cleaning Up

When we’re done with the “phonebook” space, we can delete the space and all the items in it directly from Python:

  >>> a.rm_space(’phonebook’)
  True