Fork me on GitHub

Chapter 4
Data Types

As we saw in the previous chapter, HyperDex offers support for the basic get, put, and search operations on strings and integers. In this chapter, we explore HyperDex’s richer datatypes and atomic operations on lists, sets, and maps. We will see how providing efficient atomic operations on these rich, native datastructures greatly simplifies design for applications with complicated data layout requirements.

By the end of this chapter you’ll be familiar with all datatypes provided by HyperDex, and a number of atomic operations.

4.1 Setup

As in the previous chapter, the first step is to deploy the cluster and connect a client. First we launch and initialize the coordinator:

  hyperdex coordinator -f -l 127.0.0.1 -p 1982

Next, let’s launch a few daemon processes to store data. Execute the following commands (note that each instance binds to a different port and has a different /path/to/data):

  hyperdex daemon -f --listen=127.0.0.1 --listen-port=2012 \
                     --coordinator=127.0.0.1 --coordinator-port=1982 --data=/path/to/data1
  hyperdex daemon -f --listen=127.0.0.1 --listen-port=2013 \
                     --coordinator=127.0.0.1 --coordinator-port=1982 --data=/path/to/data2
  hyperdex daemon -f --listen=127.0.0.1 --listen-port=2014 \
                     --coordinator=127.0.0.1 --coordinator-port=1982 --data=/path/to/data3

This brings up three different daemons ready to serve in the HyperDex cluster. Finally, we create a space which makes use of all three systems in the cluster. In this example, let’s create a space that may be suitable for storing profiles in a social network:

  >>> import hyperdex.admin
  >>> a = hyperdex.admin.Admin(’127.0.0.1’, 1982)
  >>> a.add_space(’’’
  ... space profiles
  ... key username
  ... attributes
  ...    string first,
  ...    string last,
  ...    int profile_views,
  ...    list(string) pending_requests,
  ...    set(string) hobbies,
  ...    map(string, string) unread_messages,
  ...    map(string, int) upvotes
  ... subspace first, last
  ... subspace profile_views
  ... ’’’)
  True
  >>> import hyperdex.client
  >>> c = hyperdex.client.Client(’127.0.0.1’, 1982)

Finally, let’s create a profile for John Smith that we can use in the rest of this chapter.

  >>> c.put(’profiles’, ’jsmith1’, {’first’: ’John’, ’last’: ’Smith’})
  True
  >>> c.get(’profiles’, ’jsmith1’)
  {’first’: ’John’, ’last’: ’Smith’,
   ’profile_views’: 0,
   ’pending_requests’: [],
   ’hobbies’: set([]),
   ’unread_messages’: {},
   ’upvotes’: {}}

4.2 Strings

The basic datatype in HyperDex is a byte string. If you don’t specify the type of an attribute when creating a space, it is automatically treated as an 8-bit bytestring. This means that you’ll have to encode and decode unicode strings as appropriate. For example, if John wanted to add an accent to his name on his social network page, the code could look like:

  >>> c.put(’profiles’, ’jsmith1’, {’first’: u’Jhn’.encode(’utf8’)})
  True

This encodes the string to raw bytes using UTF-8. When fetching his profile it is necessary to decode the UTF-8:

  >>> c.get(’profiles’, ’jsmith1’)[’first’]
  ’J\xc3\x83\xc2\xb3hn’
  >>> c.get(’profiles’, ’jsmith1’)[’first’].decode(’utf8’)
  u’Jhn’
  >>> c.put(’profiles’, ’jsmith1’, {’first’: ’John’, ’last’: ’Smith’})
  True

Of course, it’s always possible to change John’s name back to its unaccented form:

  >>> c.put(’profiles’, ’jsmith1’, {’first’: ’John’, ’last’: ’Smith’})
  True
  >>> c.get(’profiles’, ’jsmith1’)[’first’]
  ’John’

HyperDex knows nothing about encodings, so it is up to the application to encode or decode data appropriately.

4.3 Integers

As we’ve already seen, HyperDex supports get and put operations on integers. In addition to these basic operations, HyperDex provides atomic operations to manipulate integers using basic math operations. This is useful when implementing features such as page-view counters. Let’s add support for tracking the profile views of a page by incrementing the counter:

  >>> c.atomic_add(’profiles’, ’jsmith1’, {’profile_views’: 1})
  True
  >>> c.get(’profiles’, ’jsmith1’)
  {’first’: ’John’, ’last’: ’Smith’,
   ’profile_views’: 1,
   ’pending_requests’: [],
   ’hobbies’: set([]),
   ’unread_messages’: {},
   ’upvotes’: {}}

Note that this change required just one request to HyperDex. The server atomically examines the current value, and changes it by the amount specified. In this case, the profile_views attribute is incremented by one.

HyperDex supports a full range of basic operations including:

4.4 Floats

HyperDex also supports double-precision floating point types. Like integers, floats support range searches and atomic operations.

4.5 Lists

Let’s add support for friend requests using HyperDex lists the basis of the feature. For this we’ll use the pending_requests attribute in the profiles space.

Imagine that shortly after joining, John Smith receives a friend request from his friend Brian Jones. Behind the scenes, this could be implemented with a simple list operation, pushing the friend request onto John’s pending_requests:

  >>> c.list_rpush(’profiles’, ’jsmith1’, {’pending_requests’: ’bjones1’})
  True
  >>> c.get(’profiles’, ’jsmith1’)[’pending_requests’]
  [’bjones1’]

The operation list_rpush is guaranteed to be performed atomically, and will be applied consistently with respect to all other operations on the same list.

4.6 Sets

Our social networking app captures the notion of hobbies. A set of strings is a natural representation for a user’s hobbies, as each hobby is represented just once and may be added or removed.

Let’s add some hobbies to John’s profile:

  >>> hobbies = set([’hockey’, ’basket weaving’, ’hacking’,
  ...                ’air guitar rocking’])
  >>> c.set_union(’profiles’, ’jsmith1’, {’hobbies’: hobbies})
  True
  >>> c.set_add(’profiles’, ’jsmith1’, {’hobbies’: ’gaming’})
  True
  >>> c.get(’profiles’, ’jsmith1’)[’hobbies’]
  set([’hacking’, ’air guitar rocking’, ’hockey’, ’gaming’, ’basket weaving’])

If John Smith decides that his life’s dream is to just write code, he may decide to join a group on the social network filled with like-minded individuals. We can use HyperDex’s intersect primitive to narrow down his interests:

  >>> c.set_intersect(’profiles’, ’jsmith1’,
  ...                 {’hobbies’: set([’hacking’, ’programming’])})
  True
  >>> c.get(’profiles’, ’jsmith1’)[’hobbies’]
  set([’hacking’])

Notice how John’s hobbies become the intersection of his previous hobbies and the ones named in the operation.

Overall, HyperDex supports simple set assignment (using the put interface), adding and removing elements with set_add and set_remove, taking the union of a set with set_union and storing the intersection of a set with set_intersect.

4.7 Maps

Lastly, our social networking system needs a means for allowing users to exchange messages. Let’s demonstrate how we can accomplish this with the unread_messages attribute. In this contrived example, we’re going to use an object attribute as a map (aka dictionary) to map from a user name to a string that contains the message from that user, as follows:

  >>> c.map_add(’profiles’, ’jsmith1’,
  ...           {’unread_messages’ : {’bjones1’ : ’Hi John’}})
  True
  >>> c.map_add(’profiles’, ’jsmith1’,
  ...           {’unread_messages’ : {’timmy’ : ’Lunch?’}})
  True
  >>> c.get(’profiles’, ’jsmith1’)[’unread_messages’]
  {’timmy’: ’Lunch?’, ’bjones1’: ’Hi John’}

HyperDex enables map contents to be modified in-place within the map. For example, if Brian sent another message to John, we could separate the messages with ”—” and just append the new message:

  >>> c.map_string_append(’profiles’, ’jsmith1’,
  ...                      {’unread_messages’ : {’bjones1’ :  | Want to hang out?’}})
  True
  >>> c.get(’profiles’, ’jsmith1’)[’unread_messages’]
  {’timmy’: ’Lunch?’, ’bjones1’: ’Hi John | Want to hang out?’}

Note that maps may have strings or integers as values, and every atomic operation available for strings and integers is also available in map form, operating on the values of the map.

For the sake of illustrating maps involving integers, let’s imagine that we will use a map to keep track of the plus-one’s and like/dislike’s on John’s status updates.

First, let’s create some counters that will keep the net count of up and down votes corresponding to John’s link posts, with ids ”http://url1.com” and ”http://url2.com”.

  >>> url1 = "http://url1.com"
  >>> url2 = "http://url2.com"
  >>> c.map_add(’profiles’, ’jsmith1’,
  ...           {’upvotes’ : {url1 : 1, url2: 1}})
  True

So John’s posts start out with a counter set to 1, as shown above.

Imagine that two other users, Jane and Elaine, upvote John’s first link post, we would implement it like this:

  >>> c.map_atomic_add(’profiles’, ’jsmith1’, {’upvotes’ : {url1: 1}})
  True
  >>> c.map_atomic_add(’profiles’, ’jsmith1’, {’upvotes’ : {url1: 1}})
  True

Charlie, sworn enemy of John, can downvote both of John’s urls like this:

  >>> c.map_atomic_add(’profiles’, ’jsmith1’, {’upvotes’ : {url1: -1, url2: -1}})
  True

This shows that any map operation can operate atomically on a group of map attributes at the same time. This is fully transactional; all such operations will be ordered in exactly the same way on all replicas, and there is no opportunity for divergence, even through failures.

Checking where we stand:

  >>> c.get(’profiles’, ’jsmith1’)[’upvotes’]
  {’http://url1.com’: 2, ’http://url2.com’: 0}

All of the preceding operations could have been issued concurrently – the results will be the same because they commute with each other and are executed atomically.

4.8 Timestamps

HyperDex has a special set of data types for storing timestamps. While one could store timestamps as a simple integer, perhaps seconds since the unix epoch, such a strategy would not work well if the timestamps were close relative to the range of an integer. HyperDex’s timestamp types use a special hash function designed for timestamps to spread hash values evenly throughout the hash space, while retaining a customizable degree of locality in the data.

  >>> a = hyperdex.admin.Admin(’127.0.0.1’, 1982)
  >>> a.add_space(’space events_by_second key timestamp(second) when attribute body’)
  True

In this space, the key-space will be broken into one second intervals, and timestamps that fall within the same interval will be hashed to nearby to each other, while timestamps that map to different intervals will be hashed further apart in the key space.

The HyperDex timestamp type appears to our application as a normal Python datetime.datetime object, and will be converted to and from this type by the Python API. Here’s some example events that we can insert into our space:

  >>> import datetime
  >>> c.put(’events_by_second’, datetime.datetime(2015, 1, 7, 21, 40, 49),
  ...       {’body’: ’the first event’})
  ...
  True
  >>> c.put(’events_by_second’, datetime.datetime(2015, 1, 7, 21, 40, 49, 1000),
  ...       {’body’: ’this event happens 1ms after the first’})
  ...
  True
  >>> c.put(’events_by_second’, datetime.datetime(2014, 1, 7, 21, 40, 49),
  ...       {’body’: ’this event happens a year earlier than the first’})
  ...
  True
  >>> c.put(’events_by_second’, datetime.datetime(2015, 1, 7, 21, 40, 51),
  ...       {’body’: ’yet another event in our system, 2s after the first’})
  ...
  True

In this sample data, our first two events happen exactly 1ms apart, but in the same second, so they will hash very closely in the key space—likely to the exact same server. The other events will be far apart from the first, and will likely reside on different servers in a cluster with more than a few servers.

Once our data is loaded, we can easily search using standard HyperDex comparison predicates like these:

  >>> from hyperdex.client import * # for the predicates below
  >>> # Search for everything in 2014
  >>> [x for x in c.search(’events_by_second’,
  ...  {’when’: [GreaterEqual(datetime.datetime(2014, 1, 1)),
  ...            LessThan(datetime.datetime(2015, 1, 1))]})]
  ...
  [{’when’: datetime.datetime(2014, 1, 7, 21, 40, 49),
    ’body’: ’this event happens a year earlier than the first’}]
  >>> # Search for everything in 2015
  >>> [x for x in c.sorted_search(’events_by_second’,
  ...  {’when’: [GreaterEqual(datetime.datetime(2015, 1, 1)),
  ...            LessThan(datetime.datetime(2016, 1, 1))]},
  ...   ’when’, 10, ’min’)]
  ...
  [{’when’: datetime.datetime(2015, 1, 7, 21, 40, 49),
    ’body’: ’the first event’},
   {’when’: datetime.datetime(2015, 1, 7, 21, 40, 49, 1000),
    ’body’: ’this event happens 1ms after the first’},
   {’when’: datetime.datetime(2015, 1, 7, 21, 40, 51),
    ’body’: ’yet another event in our system, 2s after the first’}]

4.8.1 Timestamp Intervals

When we created our space, we declared that we wanted the timestamp to be of type timestamp(second). HyperDex supports six different timestamp types that spread the data across servers in different ways. The six types are:

Each of these types will roughly map timestamps that fall within the same time interval (e.g., second, month, etc.) nearby to each other in the key space. When deciding which timestamp type to use for our data, we should consider the data we’ll store, and the queries we’ll perform. In general, we wish to pick a timestamp type large enough such that range queries will fall within one unit of the interval. For example, if we’ll be searching for objects that fall within a 24-hour window, it doesn’t make sense to pick the second, minute, or hour types because timestamps from throughout the 24-hour window will be spread throughout the cluster. The day, week, and month timestamps make much more sense for this query workload.

Picking too large of a timestamp interval, however, can cause problems when inserting timestamps in real time. For example, if we were to always pick the timestamp(month) type, and insert one event every millisecond, the approximately 2.5 billion events inserted would map to the same server, and all writes would be directed to a different server each 28-day interval. If, instead, we picked the timestamp(second) type, writes would be directed to a different server each second, evenly consuming disk space across the cluster.