Storage Drivers in Docker: A Deep Dive
If you’ve played with Docker at all, you’ve probably at some point intersected with the concept of a storage driver. Or at some point you heard the term graphdriver and wondered what in the world that was. You may have heard people tossing around terms like aufs and devicemapper and wondered what it all meant. Recently I helped edit The New Stack’s 4th ebook in the container ecosystem series on storage, networking, and security. The chapter titled “Methods for Dealing with Container Storage” starts with a few pages on the general topic of the storage drivers in Docker but moves quickly to the hot topic of persistent storage. To add to that brief introduction, this blog post is my attempt to help sort out these concepts for you and provide some background on this key part of the Docker engine.
An important clarification as a starting point: plenty of other resources exist to help you understand persistent storage, the volume API & plugins like Flocker from ClusterHQ, EMC’s rexray/libstorage project, Portworx or other attempts to deal with persistent storage in the Docker ecosystem. This post is only focused on covering the current options available for local container image storage used when assembling image layers into the root filesystem when you start a container with
There have been some helpful attempts to cover this topic in the past. Red Hat published an overview of graph driver implementations a few years ago. In 2015, Jérôme Petazzoni created a presentation on the history and implementation of the various graphdrivers available in Docker. Early this year, Jess Frazelle wrote her “brutally honest” guide to graphdrivers, which gave a good, brief overview of all the options to date. A couple things have happened since then: a new driver was introduced in Docker 1.12 named “overlay2” which has significant improvements over the original overlay implementation and also, new capabilities around quota support were recently added to some of the graphdriver options.
Given this background, I thought it would be useful to dig a bit deeper and look at the following helpful topics:
- Why do graphdrivers exist in Docker and what is their role?
- Why do we need more than one option?
- If I have all these options, how do I decide which one to use?
What is a graphdriver?
To begin to understand the name graphdriver, we have to first understand that a local instance of a Docker engine has a cache of Docker image layers. The layered image model is one of the unique features of the Docker engine that allows for shared filesystem content between one or more running container processes. This cache of layers is built up as explicit
docker pull commands are executed, as well as
docker build. Implicitly layers may also be added to this cache as
docker run commands are executed which require layers from a registry which don’t exist locally. To manage these layers at runtime, a driver is required which supports a specific set of capabilities—abstracted to the interface found here—to mount these layers into a consolidated root filesystem for the mount namespace of the container. Given the “image graph” of layer content represents the relationships between various layers, the driver to handle these layers is called a “graphdriver.”
Two very important concepts come into play to handle this layer graph at runtime. One is the concept of a union filesystem, best defined here at its Wikipedia entry. A union filesystem implementation handles this merging of filesystem content into a single mount point. Unless you are satisfied with a read-only root filesystem, union filesystem implementations are usually paired with a copy-on-write (CoW) implementation, such that changes to any entry within the underlying filesystem layers are “copied up” to a scratch/work/upper layer of the filesystem. This writable layer can then be viewed as a “diff” applied to the lower read-only layers, which are potentially shared as an underlay to many other running container processes. This is an important point–one of the major benefits to using a layered filesystem approach with Docker is that 1000 copies of an
ubuntu:latest container running a
bash shell share a single image underlaying all 1000 copies. There are not 1000 copies of the filesystem (caveat: see vfs below), and almost as important, with the aufs and overlay implementations, shared memory for read/execute segments of shared libraries are also shared between all running containers, significantly reducing memory use for common libraries like “libc”. This is a huge benefit to the layered approach and is one of the reasons Docker’s graphdriver system is such an important part of the engine’s implementation.
So, now it should be clear what a graphdriver is and why the Docker engine implements this feature. Let’s next look at why we have so many options for a graphdriver in Docker.
So what are all these graphdrivers?!
Looking at the most recent Docker engine release, 1.12, you will find the following graphdriver options: vfs, aufs, overlay, overlay2, btrfs, zfs, devicemapper, and windows. Breaking this list of graphdrivers into a few specific categories will help us define each of them further as we go along.
The special snowflake: vfs
First, let’s get the one special graphdriver out of the way–vfs is the “naive” implementation of the interface that does not use a union filesystem or CoW techniques at all, but rather copies all the layers in order into a static subdirectory and mounts the end result as the container root filesystem. It is not meant for real (production) use but is very valuable for simple validation and testing of other parts of the Docker engine. It also comes in handy for Docker-in-Docker scenarios which can become hairy when nesting graphdrivers. Side note: the
Dockerfile used by engine developers to build Docker itself uses vfs for the “inside” graphdriver.
Interestingly, only a few of the existing graphdrivers are truly union filesystems with CoW semantics: the two versions of overlay, and the original aufs driver that has existed since the early days of Docker. Remember that a union filesystem is really just a file-based interface to interleave a set of directories into a single view, so rather than being a “real” filesystem like ext4 or xfs, it simply offers these capabilities on top of an existing filesystem. In some cases there are restrictions on the underlying filesystem, and Docker checks both the requested union filesystem and the underlying filesystem magic to make sure they are compatible.
Specific filesystem implementations
The remaining options are all based on filesystem implementations which can provide the required capabilities through built-in filesystem features, like snapshots. These include the devicemapper, zfs, and btrfs drivers. In each of these cases you will actually have to have a disk created and formatted with the filesystem (or a loop-mounted file-as-disk for quick testing) to use these options as storage backends for the Docker engine.
What operations does a graphdriver have to perform?
First we should briefly explain what operations a graphdriver must perform. This information is codified in the interface definitions for
Driver defined in the daemon codebase. Also worth noting is a
ProtoDriver wrapper implementation called
NaiveDiffDriver. For filesystems with no native handling for calculating layer differences/changes, this wrapper can be used in concert with the driver implementation to offer these “diff” calculation features using the archive package. Outside of the difference/changes methods, the most important capabilities of a graphdriver are found in the
Remove functions. To help understand the API of the graphdriver we should talk briefly about the consumer of this API, implemented in Docker as the
layerStore. This layer store is utilized by the distribution (registry) client code to add/remove layers as images are downloaded or imported using the Docker client/API commands by the end user. As we know, images can be comprised of multiple layers, and these layers have a parent/child relationship. The graphdriver is driven by the layer store code to store these layers and the relationships according to what makes the most sense for that filesystem’s implementation of union+CoW-like layering. To handle the creation and un-tarring of these layer images, the graphdriver
Create API is used as well as the
ApplyDiff API to untar the layer contents into the created location via the graphdriver. Obviously the inverse is used when an image is deleted from your local cache: the layer store will ask the graphdriver to
Remove the layer’s contents from the system.
Given the above, the graphdriver can now contain a local cache of various layers, with interrelationships correlating to named, downloaded images. At container runtime, these must be assembled into a runnable root filesystem before the container process is started. The
Get method of the graphdriver is called on a specific identifier, and at this point, the filesystem implementation of the graphdriver has to walk the parent linkages and use specific technology that filesystem offers to “stack” the layers into a single mount point, creating that writeable “upper” or top layer to handle filesystem changes made by the container. The
Put method notifies the graphdriver that the mounted resources are no longer necessary, allowing the driver in most cases to unmount the layer(s) involved.
Overview of Current Graphdrivers
With that background on what a graphdriver is meant to do in life, let’s do a quick overview of the options you have today in the Docker 1.12 engine. An important note for those who will, or have already, tried multiple graphdrivers with your Docker engine installation. Because the implementation of the layer store for each graphdriver is implementation-dependent, any images pulled or built on one graphdriver will not be available when restarting the engine with a different graphdriver. This has been known to confuse users before, but fear not; switching back to the original graphdriver will reveal that all your images and containers still exist but were “hidden” from view when you tried the engine with a different graphdriver.
- History: The
aufsdriver has existed in Docker since the beginning of time! No really, it predates the use of the moniker graphdriver, and if you look at the project history prior to that commit, there is only an aufs implementation directly in the engine. See the devicemapper section below for more on the history of the “graphdriver” creation.
- Implementation: Aufs initially stood for “another union filesystem” given it was an attempt to re-write the UnionFS implementation that existed at the time. It is, as you would expect, a traditional overlay allowing a stack of directories to be merged into a single mountpoint view of the layered content, using a feature that aufs calls “branches”. The Docker graphdriver combines the parent information into an ordered list provided via mount options to aufs so that aufs does the hard work of assembling those layers into a merged “union” view. Lots more information on the specifics on the implementation are available on the aufs manpage.
- The Good: The longest existing and possibly the most tested graphdriver backend for Docker. Reasonably performant and stable for wide range of use cases, even though it is only available on Ubuntu and Debian kernels (as noted below), there has been significant use of these two distros with Docker allowing for lots of airtime for the aufs driver to be tested in a broad set of environments. Enables shared memory pages for different containers loading the same shared libraries from the same layer (because they are the same inode on disk).
- The Bad: Aufs has never been accepted by the upstream Linux kernel community. It is a long-lived carried patchset that Ubuntu and Debian have been integrating into their kernel for many years, and the original author has given up trying to get it upstream. Similar maybe to the IPv4 vs. IPv6 debate, the worry is that aufs goes away someday during a kernel update where the aufs patch proves too onerous to migrate and disappears. But, like IPv6, that “promise” of having to migrate off aufs keeps getting moved out year after year! It also has had some challenging bugs. One of the most onerous (albeit to some degree a security feature) was a long-standing issue related to changing ownership of copied-up files in higher layers that dogged and confused many users over the years. That bug was finally fixed with the aufs feature
dirperm1added in PR #11799 in early 2015. Of course, it required that the kernel with aufs support had the newer dirperm1 capability, but today that would be true on any modern version of Ubuntu or Debian.
- Summary: If you are using Ubuntu or Debian, then obviously this graphdriver is going to be the default and will most likely meet the majority of your needs. The expectation is that one day it will be supplanted with the overlay implementation, but given challenges with overlay filesystem bugs and maturity in upstream kernels, that has yet to be realized. There is no quota support for aufs.
- History: The graphdriver for “OverlayFS” (the original upstream kernel name) was added by Alex Larsson of Red Hat in commit 453552c8384929d8ae04dcf1c6954435c0111da0 in August 2014.
- Implementation: Overlay is a union filesystem with a simpler concept than aufs’s branches model. Overlay implements its union filesystem via three concepts: a “lower-dir”, “upper-dir” and a “merged” directory for the combined view of the filesystem. Given there is only one “lower-dir,” extra work has to be done to either make lower-dir recursive (it itself being the union of another overlay), or, Docker’s implementation choice, to hardlink all the lower layer content into lower-dir. It is because of this potentially explosive inode use (for large # of layers and hardlinks) that overlay has had some challenges in adoption. Overlay2 solves this by requiring a more recent kernel (4.0 and above) that provides a more elegant way to handle multiple “lower” layers.
- The Good: Overlay holds a lot of promise as the single focus for a complete union filesystem supported and merged into the mainline Linux kernel. Similar to aufs, it also enables shared memory between disparate containers using the same on-disk shared libraries. Overlay also has a lot of upstream Linux kernel focus on continued development (see overlay2 for an example) based on modern use cases like Docker.
- The Bad: The hardlink implementation has caused overlay to have inode exhaustion problems, inhibiting widespread use. The inode exhaustion issue is not the only problem, and several bugs around user namespaces, SELinux support, and overall maturity has kept overlay from directly replacing aufs as the default graphdriver in Docker. Many of these issues are now solved, especially in recent kernel releases, and overlay has become much more viable. Overlay2 has now arrived and corrects the inode exhaustion issue, and should be the focus from Docker 1.12 forward for continued development of the overlay driver. For backwards compatibility reasons, however, the
overlaydriver will remain in the Docker engine to support existing uses.
- Summary: Adding the overlay graphdriver was a big step forward to have an upstream-integrated union filesystem that had the backing of the Linux kernel filesystem community given the lack of broad distro support for aufs. Overlay has matured a lot in the last 18-24 months, and with the inclusion of overlay2, some of its more troubling weaknesses have been solved. Look for overlay (or more likely overlay2) to be a replacement as the default graphdriver in the future. For best possible experience with overlay, note that the upstream kernel community fixed many issues in the overlay implementation in the 4.4.x kernel series; selecting a later instance of that series will provide the best possible conditions for overlay performance and stability*.
- History: Derek McGowan added the
overlay2graphdriver to Docker in PR #22126, merged in June 2016 in time for the Docker 1.12 release. As the PR title noted, the main reason for a replacement to the original overlay was to add “multiple lower directory support,” solving the inode exhaustion problem from the original driver.
- Implementation: The overlay section above already describes the overlay framework in the Linux kernel. The PR linked above expresses the design changes, based on a newer feature of overlay in Linux kernel 4.0 and above allowing multiple lower directories.
- The Good: Overlay2 resolves the inode exhaustion problem as well as a few other bugs that were inherent to the old design of the original driver. Overlay2 retains all the benefits already noted for overlay, including the shared memory for libraries loaded from the same exact layer(s) across containers on the same engine.
- The Bad: About the only charge we could level against overlay2 is that it is a young codebase. Many early issues have quickly been solved through early testing, but Docker 1.12 is the first release offering overlay2, and we can assume other bugs may be found as use grows.
- Summary: The good news is that combining the upstream benefits of a modern, supported union filesystem in the Linux kernel, and a performant graphdriver with few limitations in Docker should be the best path forward for a future default in the engine that has broad support across many distributions of Linux.
- History: An implementation to use
btrfsas the filesystem managing
/var/lib/dockerwas added by Alex Larsson of Red Hat in commit e51af36a85126aca6bf6da5291eaf960fd82aa56 in late November 2013.
- Implementation: Btrfs has native features named subvolumes and snapshots. These two capabilities combine to provide the stacking and CoW-like features utilized by the graphdriver implementation. Of course, a disk formatted as a
btrfsfilesystem is required as the graphdriver root (by default,
- The Good: Btrfs was seen as the future of Linux filesystems and received a lot of attention when it was first introduced in the mainline Linux kernel years ago (2007-2009 era). The filesystem is today solid and well-supported as one of many filesystem options within the upstream Linux kernel.
- The Bad: Btrfs hasn’t really been a mainstream choice for Linux distributions, so it is unlikely you have a disk formatted with btrfs on your system already. Because of limited use by default in Linux distributions, it has not received as much attention and testing as other graphdrivers.
- Summary: If you are using btrfs, then obviously this graphdriver may suit your needs. It has had a handful of bugs over the years, and for awhile had no SELinux support, although that has been corrected. Also, quota support was added directly to the Docker daemon for btrfs in PR #19651 by Zhu Guihua, which was included in the Docker 1.12 release.
- History:Devicemapper came to life very early on as a simple wrapper of C code interacting with libdevmapper; Alex Larsson made this commit, 739af0a17f6a5a9956bbc9fd1e81e4d40bff8167, in early September 2013. A few months later this code was refactored to have the “graphdriver” moniker we now know; Solomon Hykes merged this in early November with the comment: Integrate devmapper and aufs into the common “graphdriver” framework.
- Implementation: The devicemapper graphdriver uses one of the many features of the Linux devicemapper code called “thin provisioning” or “thinp” for short. It is quite unlike the union filesystems described earlier in that devicemapper works on block devices. These block devices are thinly-provisioned to give the same lightweight behavior of the union filesystem approach, but most importantly to understand from an end user perspective, they are not file based. As you can expect this impacts capabilities such as easy differencing between layers as well as the lack of any shared memory between containers for the same shared library segments.
- The Good: Devicemapper has received its share of disdain over the years, but it provided a very important capability for Red Hat distributions (Fedora, RHEL, Project Atomic-variants of the same) to have a graphdriver. Given it was also block device based instead of file-based, it has inherent capabilities like quota support not available (easily) within other implementations.
- The Bad: There is no way to get default “out of the box” performance with devicemapper. There are setup and configuration instructions that must be followed to get a reasonably performant configuration, and most importantly, you should not run “loopback” mode anywhere you expect serious use of the Docker engine. Some of the features rely on specific versions of libdevmapper (like deferred removal, which is absolutely necessary to reduce what appear to be engine “hangs” on slow removal in a heavily loaded system running devicemapper), and it requires above-average skill to validate all these settings on a system. Also, devicemapper will not work at all when the Docker engine binary is statically compiled, due to a requirement on udev sync support, which cannot be statically compiled into the engine.
- Summary: For Red Hat distributions, devicemapper has been the “go to” graphdriver and has received lots of support and improvements from the Red Hat team over the years. It has had its share of quality ups and downs and bugs as well, and without significant care in setup/configuration, can have very poor performance and quality compared to other options. Given overlay and overlay2 are supported on Fedora and RHEL and have SELinux support in recent kernels, unless there is a strong need for devicemapper in a specific Red Hat context, I assume users will migrate to the overlay options as they mature.
- History: The zfs graphdriver was implemented by Arthur Gautier and Jörg Thalheim in PR #9411, merged into the Docker engine in May 2015 and available to users since the Docker 1.7 release. The implementation relies on a Go 3rd party library go-zfs to handle the zfs command line interactions.
- Implementation: Similar to btrfs and devicemapper drivers, a ZFS-formatted block device mounted in the graphdriver path (by default,
/var/lib/docker) is required to use the zfs driver. The utilities for zfs (usually a package named
zfs-utilson most distros) are also required as the zfs Go library will be calling out to these tools to perform the required work. ZFS has the capability to create snapshots (similar to btrfs) and clones of snapshots are then used as a way to share layers (which become a snapshot in the ZFS implementation). Again, given ZFS is not a file-based implementation, the memory sharing capabilities of aufs and overlay are not available here in ZFS.
- The Good: ZFS has been gaining popularity and is already in use as the filesystem for containers in Ubuntu’s LXC/LXD offering as of Ubuntu 16.04. ZFS has been around for a long time, created by Sun originally and used in Solaris and many BSD variants, and its Linux port seems stable and reasonably performant for container filesystem needs. The
zfsgraphdriver also received quota support in time for Docker 1.12 via PR #21946, bringing ZFS in line with btrfs and devicemapper regarding quota support.
- The Bad: Other than the lack of file-based (inode) sharing for shared libraries, it’s hard to say what the downsides are to ZFS as compared to the other block device based offerings. In comparison, ZFS seems to be gaining in popularity. For Linux distributions or UNIX variants where ZFS is fully supported and/or is already being used, the
zfsgraphdriver could be a very good choice.
- Summary: ZFS support is a valuable addition to the graphdriver stable in the Docker engine. For users of ZFS and distros where ZFS is playing a more major role, the ability for Docker to support the filesystem directly is a good benefit for those communities. Time will tell if there is an uptick in interest and use of this graphdriver as compared to overlay on default distro filesystems like ext4 and xfs.
The Next Level of Detail
To really dig into the details of how each of these filesystems operates as a graphdriver would require thousands more words! More importantly, the Docker community has documented a good bit of this detail in the official storage driver documentation. Feel free to hop over there if your curiosity has been piqued by any of these drivers. Here are direct links to each of the articles for the graphdrivers: aufs, devicemapper, overlay, zfs, and btrfs.
If you’ve read carefully you might notice that I mentioned a “windows” graphdriver at the beginning but haven’t mentioned it since. Clearly the “windows” graphdriver is used as the storage driver on the recent port of the Docker engine to Windows Server 2016, announced at MS Ignite in Atlanta this week. I don’t have a lot of details personally about the implementation of the windows graphdriver, but hopefully we can have a future blog post or point to a Microsoft team post about how the graphdriver on Windows operates in the near future.
- Akihiro Suda has assembled a comprehensive list of issues for the various graphdrivers here that is an extremely useful resource if you are finding issues with any of the graphdriver options.**
** Thanks to Arnaud Porterie for the reference to Akihiro Suda’s graphdriver issue list.