Introducing BenchPress: Distributed Load Testing for NoSQL Databases

Recently a client of ours posed an interesting question: they wanted to store many tens of thousands of objects per second, with each object needing several KiB of space, and they wanted to know which storage systems were capable of handling that much load. They were already using a SQL-based DBMS, but it was struggling to keep up with the load on high-end hardware. With the hardware at its limits and ever-increasing load looming, non-relational storage systems seemed like a good fit. The question was, what system could best handle their workload? To guide the search, we wanted a benchmarking tool that could efficiently and easily generate test workloads for a variety of different storage systems.

The target workload exceeded what can be pushed across a gigabit Ethernet interface, so the tool needed to be able to coordinate load generation across many nodes. We also wanted to be able to drive testing completely programmatically so that we could easily compare a variety of different workloads. Since we don’t have racks full of idle hardware lying around, we also wanted something that would be easy to spin up on EC2. And, of course, an easy setup procedure would be nice, especially when we want to get external teams up to speed quickly.

Current tool landscape

Grinder is one option for a benchmarking tool. It distributes work to multiple workers, and is mostly written in Java (user-defined workload scripts are written in Jython). Its gui-based control structure is great for quickly putting together fixed test scenarios, but is awkward for the sort of vendor-neutral, programmatically-driven testing that we wanted to do. Using user-defined scripts to generate load is a flexible approach, but also requires a fair amount of user effort to build support for each type of storage system.

YCSB is another choice. It supports many different databases, and requires less setup work than Grinder, but it doesn’t deal with distributed workloads (aside from “run multiple clients at once and merge the results later”). Configuring the various clients is more labor intensive than we’d like, especially when such configuration needs to be manually re-done on each node. It can be controlled via shell commands and properties files.

Though YCSB was a closer match to our priorities than Grinder, we thought that there was room for improvement for the sort of testing that we wanted to do, so we decided to build a simple tool called BenchPress with a distributed-by-default approach, programmatic job control, and a focus on minimizing the time from downloading the tool to generating useful measurements. It’s not quite ready to be released to the world yet (though it will be soon!), but we’re nonetheless eager to get feedback on the approach we’re taking and what we can do to make the tool useful for other people’s performance investigations.

System structure

The goals we set out to achieve were these:

  1. Generate hundreds of MiB/s of load to a storage system.
  2. Work across a variety of storage systems.
  3. Capture basic performance statistics.
  4. Customize the generated data (what keys, values, etc. are used).
  5. Minimize the learning and setup required before useful data can be gathered.
  6. Build the simplest thing that could possibly work.

There are a few initial conclusions we can draw from these goals.

  • We need to distribute load across many worker nodes since we can’t assume anything faster than 1Gbit Ethernet.
  • Since there will be many worker nodes, we need to coordinate them. The simplest way to manage a cluster of nodes is to have just one centralized controller node. As long as we’re careful to limit the responsibilities of the controller, this isn’t a bottleneck, and HA isn’t really a concern for this task.
  • The controller would need to know which workers exist in the system, and though we could configure the controller with a list of workers, or vice versa, a service discovery approach would simplify setup.
  • We should avoid requiring any per-node setup beyond having a Java runtime, decompressing the BenchPress distribution, and running a startup shell script.
  • We’d like to allow common workloads to be defined declaratively so that users don’t have to write scripts or custom workload implementations.
  • The system should make performance data from all workers easily available to the user.

Given the above guidelines, we implemented the following structure.

  • Workers register their existence with ZooKeeper via Curator so that the controller can dynamically find workers at runtime. This means that the only setup that needs to be done to get BenchPress nodes up and running is to configure the workers and controller with the location of a ZooKeeper cluster. A single standalone ZooKeeper server is fine for this limited use, or you can just have the controller run an embedded ZooKeeper server.
  • Workers and the controller communicate via a simple REST API. There’s nothing performance critical happening between workers and the controller, so ease of use is a bigger priority than wire efficiency, and JSON over HTTP is as easy as it gets.
  • Workload job definitions are submitted to an endpoint in the controller’s REST API. A user can just use curl and a text editor, but programmatic job submission is also easy.
  • Workload job definitions contain all the configuration information needed to connect to the target storage system (e.g. for HBase, things like the ZooKeeper quorum). This means that no further setup needs to occur on each worker node to connect to HBase or MongoDB or any other system. It all just works out of the box.

Workload job flow

POSTing the following sample JSON to the controller’s API will start the job. In this case, the job will write 1,000,000 objects into MongoDB.

    "task": {
        "type": "MONGODB",
        "config": {
            "hostname": "",
            "port": 27017,
            "dbName": "foo",
            "collectionName": "bar"
        "op": "WRITE",
        "threads": 4,
        "quanta": 1000000,
        "batchSize": 1000,
        "keyGen": {
            "type": "WORKER_ID_THREAD_ID_COUNTER"
        "valueGen": {
            "type": "ZERO_BYTE_ARRAY",
            "config": {
                "size": 10000
        "progressReportInterval": 10000

Of note is that the user does not provide any implementation of how to write to MongoDB, and instead only provides (hopefully self-explanatory) configuration directives. This means that to perform the same test against an HBase cluster requires only small changes to the JSON (HBase needs a ZooKeeper quorum string, for instance). Though this declarative approach does constrain the user to the types of workload for which there are configuration options, we think that the user-friendliness is worth the tradeoff given the ease of implementing new storage systems and configuration options. Of course, custom job types are also possible.

The workload will be divided by the controller into partitions (one for each worker). Each worker will send performance data back to the controller as it proceeds through its partition of the work, and the controller aggregates all of this data and makes it available through the REST API.

Future plans

We’ve got lots of ideas that we’d like to implement. A few of them are:

  • Support more storage systems (DynamoDB, Riak, …). Adding support for more systems is easy, so why not enable users to do more out of the box?
  • More workload configuration options. Some real-life workloads use lots of columns, for instance, so it would be good to be able to concisely configure a workload to generate large numbers of arbitrary columns.
  • Automated cluster setup via tools like Whirr. Though node autodiscovery is a good start, it would be even better to be able to start up a fully functional cluster with a Whirr recipe.
  • More performance analysis options. CSV export, permanent performance data storage, a graph-laden web UI, and streaming live job progress to external systems would all be useful.

If you have input on what features would be useful to you, let us know in the comments or via email. BenchPress is on Github for your browsing pleasure.

Posted by Marshall Pierce

Marshall specializes in highly tuned and immensely scalable web and mobile applications. Experienced in front-end web and iOS development, he constantly pushes the boundaries of the latest browsers and mobile platforms. He splits his time with back-end development, where he is considered a domain expert in Java concurrency, distributed systems, systems design, and network security. Prior to co-founding Palomino Labs, Marshall was director of software development at Ness Computing where he led their initial launch. Before Ness, Marshall was a senior software developer at, where he built the best-in-class integration with

About Palomino Labs

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality.

See the Palomino Labs website for more information, or send us an email and let's start talking about how we can work together.