Picking a NoSQL data store is a difficult task. To help with this process, we provide the results of a comparison between HyperDex and other popular NoSQL systems.
We use the Yahoo!Cloud Serving Benchmark (YCSB) for these comparisons. This is a widely-used, long-established, open-source benchmark for NoSQL systems. It was developed by an independent party, Yahoo, and published in 2010, has over 70 citations, and has been used extensively for benchmarking. The HyperDex team has had no bearing on the development of this benchmark. It is the de facto, and to our knowledge, the only credible open-source benchmark of its kind.
We benchmark HyperDex against MongoDB and Cassandra, two widely-known NoSQL systems. While our benchmarking setup is described in detail on the setup page, the key point is that, overall, the comparison platform provides a fair, apples-to-apples comparison. Specifically:
We use up-to-date versions of Cassandra, MongoDB, and HyperDex. We spent substantial time setting up Cassandra so as to avoid its known problems with automatic partitioning, and got help from 10gen in setting up MongoDB. (issues)
All systems are configured to replicate each object on two nodes. Cassandra and MongoDB offer weaker fault-tolerance guarantees than HyperDex, because they each have a window of vulnerability where a single failure may cause complete data loss. HyperDex has no such window of vulnerability.
Systems are configured with their default, out-of-the box consistency settings. This places HyperDex, with its strong consistency guarantees, at a disadvantage compared to the weak eventual consistency offered by MongoDB and Cassandra.
In the following sections, we investigate the average throughput achieved by each NoSQL system on each of the six workloads specified by the YCSB benchmark. YCSB consists of six separate workloads, each patterned after a different style of database usage that Yahoo developers encountered in the large-scale systems they have built.
The first benchmark (YCSB Workload A) is patterned after a session store that records recent actions in a user session. This benchmark consists of 50% read and 50% update operations on keys chosen from a Zipf distribution.
The graph on the right shows the average throughput achieved by each system on this benchmark. HyperDex achieves a throughput that is 1.5 times higher than Cassandra, the closest competitor. Note that not only is Cassandra optimized for writes and is well-known for its heavily optimized write path, but this benchmark uses a ConsistencyLevel of ONE for Cassandra, which only requires an acknowledgment from a single replica. A node failure in Cassandra that occurs after the acknowledgment but before the data has been transferred to other replicas, can lead to data loss. In contrast, no HyperDex operation is considered complete until it is fully fault-tolerant, and yet it still outperforms the write-optimized Cassandra on the overall benchmark.
The third benchmark (YCSB Workload C) is patterned after applications that access a large pre-computed data repository. The usage scenario approximates a user profile cache where profiles are generated offline using a system such as Hadoop, and stored in the NoSQL store as a cache. Consequently, this benchmark consists of 100% read operations.
HyperDex performs reads quickly and, consequently, outperforms Cassandra and MongoDB by a factor of 3-4 on this benchmark. Such a 100% read workload might arise when a website implements auto-complete for input boxes.
The fourth benchmark (YCSB Workload D) is patterned after a social network where users post status updates to be seen by other users. This benchmark consists of 95% read operations and 5% insert operations, on keys chosen from a temporally weighted distribution. This benchmark differs from Photo Tags in two critical ways: new objects are created on the fly (insert vs. update) and newly inserted keys are preferentially chosen for retrieval (temporally weighted vs. Zipf).
On this benchmark, HyperDex outperforms the other systems by a factor of 2 to 3. Note that the results of this benchmark are very similar to the results for the Photo Tags benchmark; this indicates that none of the systems have a bias for or against retrieving recently inserted objects. HyperDex's internal architecture makes read latencies very low, whether the reads are for old or recently created objects.
The fifth benchmark (YCSB Workload E) is patterned after an online forum with threaded conversations. It is based around a SCAN operation that retrieves posts in a given thread, clustered by thread ID. 95% of operations in this workload are scan operations that perform a range search. For HyperDex, we separate the thread ID from the key and insert it as its own attribute of the object, forcing HyperDex to operate on secondary attributes where the other two systems operate on the primary key.
HyperDex is nearly twice as fast as Cassandra, even though it is performing a retrieval by a non-primary key. MongoDB was unable to finish the benchmark in the allotted time, averaging 6.4 operations per second for an hour. A previous version (v2.0) of MongoDB was able to finish this benchmark in time, and underperformed both Cassandra and HyperDex.
The final benchmark (YCSB Workload F) is patterned after a user database. Operations are 50% read, 50% read-modify-write with objects chosen from a Zipf distribution. The benchmark is designed to execute a read and a write for each read-modify-write request and does so back-to-back from the same client thread for each key.
HyperDex is four times faster than MongoDB and provides nearly double the throughput that Cassandra provides.
Get HyperDex at the HyperDex Downloads Page.
To get started with HyperDex, checkout the HyperDex Quickstart Tutorial.