HyperLevelDB is the data storage engine that powers HyperDex. We forked HyperLevelDB from Google's LevelDB and adapted it to more closely meet the needs of HyperDex.
HyperLevelDB improves on LevelDB in two key ways:
All numbers reported here were taken from the LevelDB
tool provided with both LevelDB and HyperLevelDB. For all reported experiments,
*LevelDB is configured with a 64MB write buffer, and executes the "fillrandom"
benchmark. Each experiment was run on two platforms:
|Older System||Newer System|
|CPU||Intel Xeon 2.5 Ghz E5420||Intel Core i5 750|
|RAM||16 GB||4 GB|
|Disk||500 GB SATA 3.0 Gbps HDD||Intel 520 Series 120 GB SSD|
|OS||Debian 7 64-bit||Debian 7 64-bit|
Internally, LevelDB and HyperLevelDB both allow concurrent writers to safely insert data into the database by providing internal synchronization. LevelDB uses very coarse-grained synchronization which forces all writes to proceed in an ordered, first-come-first-served fashion, effectively reducing throughput to that of a single thread. HyperLevelDB increases concurrency by allowing multiple threads to agree on the order of their respective writes, and then independently apply the writes in a manner consistent with the agreed-upon order.
The figure to the right shows the effect of HyperLevelDB's increased parallelism on the Older System Configuration using the "fillrand" benchmark. In the single-threaded case, HyperLevelDB achieves nearly the same throughput as LevelDB. The performance difference is largely due to increased overhead associated with the fine-grained locking. The fine-grained locking pays off, however, as HyperLevelDB enables the writer threads to achieve a throughput that is approximately one and a half times that of LevelDB when there are two or more threads.
The difference in the fine-grained locking is more visible with the Newer System Configuration, shown at the left. In this configuration, we see that the HyperLevelDB writers achieve higher aggregate throughput with multiple threads than with a single thread. For the same workload, LevelDB's writers see total throughput under concurrent writers drop to one third of the throughput achieved by a single thread in isolation.
Both LevelDB and HyperLevelDB periodically rearrange the data on disk to merge newly inserted items. This process, called "compaction" is expensive as it rewrites existing data, possibly more than once over the course of a database's lifetime, slowing down other operations on the database. HyperLevelDB changes LevelDB's compaction algorithm to not block writers on the background compaction threads, and to improve the efficiency of the compaction itself.
The default LevelDB algorithm blocks writers for one millisecond each time the amount of uncompacted data exceeds a predefined threshold, thus limiting further writes to at most one-thousand writes per second. This effectively stops writes until compaction. By limiting writes in this manner, LevelDB ensures that the cost of a read will not grow without bound -- effectively trading write throughput for read throughput. HyperLevelDB removes this artificial delay, allowing the application (in our case, HyperDex) to independently decide to delay writes, using information available outside the scope of LevelDB.
LevelDB arranges data in multiple levels, where objects within a level are sorted by key, and each successively higher level holds ten times more data than the previous level. The compaction process moves data upwards between levels, maintaining the order invariant by picking from adjacent levels, files that totally overlap, and performing a merge on those files. The level to compact is chosen by setting an upper bound on the size of data within the level and picking the level which exceeds this bound by the greatest amount. Within each level, the data to compact is chosen by progressively moving through the key-space, returning to the beginning only reaching the end of the key space. Effectively, this process keeps the lowest levels as full as possible, and evenly compacts the data within each level.
The downside to LevelDB's approach to compaction, which HyperLevelDB corrects, is that it introduces a high overhead for compaction. The overhead of a compaction between two levels is the ratio of data moved from the lower level to data rewritten in the upper level. Intuitively, an efficient compaction moves more data between levels than it rewrites at the upper level. LevelDB's compaction algorithm is not efficient, and in the "fillrand" benchmark will, on average, rewrite 3MB of data in the upper level for every 1MB of data in the lower level. HyperLevelDB avoids this waste by selecting the compaction with the smallest overhead. By reducing overhead, HyperLevelDB reduces the amount of data written, enabling the system to perform additional compaction, or write operations.
The figure on the left shows the throughput of LevelDB and HyperLevelDB on the Older System Configuration "fillrand" benchmark as the number of inserted entries grows. LevelDB's effective throughput drops to 7% of its highest throughput, while HyperLevelDB remains above 62% of its highest achieved throughput. The difference is just as significant on the Newer System Configuration, shown at left, where HyperLevelDB retains 75% of its throughput while LevelDB drops to below 13%.
LevelDB is a great storage engine for local storage applications --- so great in fact that we adopted it as the de-facto storage engine for HyperDex. With these changes, LevelDB continues to be one of the best-suited backends for HyperDex.