Bucketbench: Comparing Container Runtime Performance

I’ve had a few opportunities in 2017 to give talks about my bucketbench container runtime performance project, but I thought it was time to write up a blog post that will be a more permanent home for basic information and backstory on how it came to exist.

First of all, for the 25 minute overview via video, I’ve embedded the YouTube version of my OSCON Open Container Day talk from May below. You can also access both the video and slides at that talk’s slideshare location.

In 2016, I was approached by our IBM team who created the OpenWhisk serverless platform (now an Apache incubation project) looking to dig into some performance issues they were seeing with the Docker engine-based function executor they had created. After some discussion, it was clear we would need a way to do comparisons between various lifecycle events against a number of runtime components. Given they had heard about runc and containerd, they were interested to investigate whether using the full Docker engine was the best approach for the OpenWhisk function engine.

At the time this team had already developed scripts do some performance comparisons, but they were a bit brittle for customizing or changing easily. At that point I decided it would be an interesting effort to create a Golang-based framework for running a series of container lifecycle commands against a configurable set of container engines, and so was born bucketbench.


Currently bucketbench has driver implementations for Docker, runc, and containerd. Specifically for containerd there are two drivers: one which operates against the 0.2.x branch (used by the current Docker engine releases) using the ctr client, and another which uses the gRPC client library targeting containerd 1.0, which will be releasing soon.

Each driver implements the following lifecycle commands: create, run, stop, delete, pause, and resume. Any container engine which can implement these simple commands can be added as a driver implementation in bucketbench.  At this point my focus has been on the layer stack underneath Docker, but other engines like rkt, lxc/lxd, or others could easily be added.

Due to the nature of each of the currently implemented drivers and how they handle image or filesystem inputs, the only complexity has been sharing the concept of “what to run” within the driver abstraction. At this point, only runc and the legacy ctr driver do not handle the common registry/Docker or OCI image concept, so for those two drivers the “image” is a filesystem path to an OCI bundle containing an exploded root filesystem and a config.json OCI runtime specification. With that minor inconvenience in the abstraction handled, the benchmark code can use the same driver API against any of the drivers without any special casing for a specific driver.


One of the enhancements I have made in recent weeks is to move away from the benchmark code itself being an abstraction which requires a Golang implementation to add new benchmark runs over a set of lifecycle commands. Instead, the user can now write a simple YAML definition file listing the drivers requested, the number of threads and containers to create in each, and a few core details (image, command, benchmark title) and be able to run that series and get results immediately without writing any Go code. The below snippet shows an example benchmark against Docker and runc using the alpine image.

name: Basic
image: alpine:latest
command: date
rootfs: /home/estesp/containers/alpine
detached: true
   type: Docker
   threads: 5
   iterations: 15
   type: Runc
   threads: 5
   iterations: 50
  - run
  - stop
  - remove

Given this simple format, you can combine driver, thread, and iteration (which equates to the number of containers taken through the command progression within each thread) configurations and get basic output on the number of iterations/second, as well as detailed performance data on each lifecycle command (min, max, avg, etc.) for each run.

Have a custom binary (or UNIX socket path, in the case of the containerd driver) that you want to run against? An optional binary: entry can be specified in the driver section of the YAML to specify an exact binary or UNIX socket path to use for that particular driver client.

The way that bucketbench currently handles multi-threaded operation when greater than one thread is requested is to perform n runs, with each run increasing the number of concurrent threads until n is reached. With this style of operation, you can see in the results how, for example, start performance is affected as concurrency increases. For example, you may start to see significant outliers (e.g. a much higher max in the per-operation output along with a larger standard deviation) as the concurrency increases. Obviously the average time per lifecycle operation will tend to rise as well as concurrency goes up.


The output of bucketbench is fairly self-explanatory. For each benchmark run per driver, a row of output is displayed showing the number of executions of the entire lifecycle command list per second against that driver.  This information is displayed with columns representing the number of concurrent threads. Obviously with more concurrency you would expect this number to rise as more operations are run per time slice, even as single operation times are increasing due to concurrency affecting IO, CPU and/or memory pressure.

As an aside, you may notice a Limit entry in the output that was copied over from the original scripts that ran for the OpenWhisk performance data gathering. The expectation was that the upper bound or “limit” of containers start performance would be the creation/execution of a single process on Linux. So the Limit output shows from 1 to 10 threads your systems capability to exec as many processes as possible per second. This may or may be that useful to a particular scenario, other than maybe validating that one system has similar overall process execution performance to another. This Limit benchmark can easily be skipped by providing the -s flag to bucketbench.

In addition to this simple “iterations per second” output, the second section of detailed output shows the performance for each command listed in the benchmark against the number of threads. The following output shows the performance of runstop, and delete against the Docker driver for each run of 1, 2 and finally 3 concurrent threads. For each command you get the minimum, maximum, average, median, and standard deviation. Hopefully a future PR can provide other output formats like CSV to make it easy to take this data directly into chart form for comparisons. Note that if any lifecycle command ends in error (e.g. “can’t delete running container”) an error count is also reported in the detailed command output. An significant increase in errors as concurrency increases may be cause to look deeper for any potential problems in the container engine with concurrency, data races, etc.


                      Iter/Thd    1 thrd   2 thrds   3 thrds 
  DockerBasic:Docker        15      1.58      2.29      2.90 


  DockerBasic:Docker:1       Min       Max       Avg    Median    Stddev    Errors
                   run    343.00    399.00    367.27    366.00     14.13         0
                  stop    217.00    257.00    234.80    234.00      9.61         0
                delete     24.00     33.00     26.73     26.00      2.67         0
  DockerBasic:Docker:2       Min       Max       Avg    Median    Stddev    Errors
                   run    394.00    605.00    515.00    515.50     48.10         0
                  stop    252.00    406.00    315.57    309.50     31.42         0
                delete     25.00     54.00     34.30     34.00      7.33         0
  DockerBasic:Docker:3       Min       Max       Avg    Median    Stddev    Errors
                   run    467.00    810.00    618.56    623.00     68.82         0
                  stop    250.00    477.00    359.27    361.00     55.15         1
                delete     26.00     68.00     37.18     35.00      9.88         0

Other Info/Caveats

Currently bucketbench is still fairly young in development terms. There are weaknesses to the model, as well as current simplifications for ease of initial development. For example, the benchmark parsing code does not check for reasonableness of the command list/ordering. It is up to the benchmark creator to make sure that the lifecycle commands make sense (e.g. no delete before create/start).

The ctr legacy driver has not been tested thoroughly in any recent Docker release as much of the focus has turned to the gRPC interface for containerd 1.0.

One of the further steps that I haven’t had time to figure out is how to integrate bucketbench with other tracing or debug tooling to provide snapshots of useful lower-layer performance data that is not overwhelming given the number of containers that are run/executed at higher orders of concurrency. You may notice a trace flag that is not currently used/passed through in the driver code that I had been playing with in the runc environment. Any form of trace probes or intermittent capture of timings during the bucketbench runs might produce really useful data to investigate after the fact to better understand slowdowns at higher concurrency rates or other interesting performance challenges.

I would love feedback, PRs, and any other comments from those who are interested in digging into container runtime performance testing and evaluation!

Check out the bucketbench repository

You may also like...

1 Response

Leave a Reply

Your email address will not be published. Required fields are marked *