User namespaces have arrived in Docker!
02/08/2016 UPDATE: Check out my latest blog post on security and user namespaces now that Docker 1.10 is officially released.
01/13/2016 UPDATE: User namespaces have migrated out of experimental for inclusion in the Docker 1.10 release slated for February 2016. Minor changes have been made but the following post is still effectively correct. The most significant change is that non-user namespace cached content (e.g. image layers) are no longer migrated to
/var/lib/docker/0.0 but left in the root of the graph root directory, allowing for smooth upgrade/downgrade to and from 1.10 without impact.
Well, after a number of speed bumps along the way and a more than trivial number of discussions on the questions of how to implement and how to expose to users, I am happy to announce that our initial approach to user namespaces in Docker containers is now available. As announced in the tweet below from Arnaud, for the Docker 1.9 release timeframe, the ability to enable user namespaces will be available only in the experimental build while we understand more completely the edges of the functionality and how it interacts with other key features in the Docker ecosystem.
Phase 1 @Docker support for user namespaces is now in experimental branch! Congratulations @estesp! ? https://t.co/a25LizygRV
— Arnaud Porterie (@icecrime) October 10, 2015
Of course, your first question after all this excitement may be “So what are user namespaces?” User namespaces are simply another containment namespace available in the Linux kernel (and the youngest of them all), similar to the key namespaces–like mount, uts, pid, and network–which are used to create these things we call containers. If namespaces in general are a new topic, Michael Crosby‘s “Creating Containers” post is a worthy read on the topic. For more specific details on the user namespace functionality in Linux, the “Namespaces in Operation, part 5” entry by Michael Kerrisk at LWN.net is a great read. Once you’ve brushed up on namespaces in general, and learned a bit more about user namespaces, you’ll probably understand that one of the most important features of user namespaces is that it allows containers to have a different view of the
gid ranges than the host system. Specifically, a process (and in our case, the process(es) inside our container) can be provided a set of mappings from the host
gid space, such that when the process thinks it is running as
uid 0 (commonly known as “root”), it may actually be running as
uid 1000, or
10000, or even
34934322. It all depends on the mappings we provide when we create the process inside a user namespace. Of course, it should be clear that from a security perspective this is a great feature as it allows our containers to continue running with root privileges, but without actually having any root privilege on the host.
In Docker’s case, we have provided a new daemon startup flag allowing the administrator to provide the name of a user, and optionally a group (or that user or group’s numeric ID), as the remapping user. Because the creators of the Linux user namespace functionality already thought through how to provide what they called subordinate ranges of user and group numeric IDs to a specific user using the
/etc/subgid files, Docker then looks for that remapping user‘s ranges in the subordinate range files. Those ranges are then used to create the maps during process creation when applications are started either via
docker run or
docker exec on this daemon instance. More complete documentation is available on exactly how Docker uses these ranges in the user namespace experimental documentation.
Of course, the next step that users (and container cloud operators specifically) will want is the ability to specify mappings at the per-container level rather than per-daemon. In the case of a public cloud operator this would allow for the added security of each tenant having their own
gid ranges which have no overlap with other tenants–and of course we want to provide that ability. However, the critical missing functionality at this point is the ability to mount filesystems on Linux with a “shift offset” for the ownership information of each filesystem entry. Without this capability, Docker would have to “unshare” the filesystem components that today are shared using the various copy-on-write backend filesystem drivers available in Docker. This would mean a loss of the disk space savings that is provided in Docker: filesystem layers–for example the layers of the
ubuntu:14.04 image–are shared between containers. Without a mount-with-uid-gid shift-like function, Docker would need to copy each root filesystem and then
chown the entire tree of files using this shift-offset manually to get the ownership to match the specific mappings desired for the container process which will act on the filesystem. There is work going on in the Linux community to enable this behavior at the filesystem level, but it is unclear what the timeframe for this will be, and therefore, the “Phase 2” of user namespaces is hard to predict as far as when it can be implemented properly.
For those who are already playing with user namespace functionality, please provide feedback on this currently-experimental feature in Docker so we can continue shaping it for future stable releases. If this all sounds interesting, but you don’t know where to start, read through the documentation on user namespaces, and then grab an experimental build from experimental.docker.com and try it out!
One important note: due to the need to segregate content in the Docker daemon’s local cache of layer data by the mappings provided, once you use an experimental build with user namespaces, the root of your graph directory (
/var/lib/docker by default) will have one additional level of indirection which correlates to the remapped root
gid. For example, if the remapping user I provide to the
--userns-remap flag has subordinate user and group ranges that begin with ID 10000, then the root of the graph directory for all images and containers running with that remap setting will reside in
/var/lib/docker/10000.10000. If you use the experimental build but don’t provide user namespace remapping, your current content will be migrated to
/var/lib/docker/0.0 to differentiate it from remapped layer content. If you want or need to return to a Docker build which is not enabled for user namespaces, the simplest method to return is to stop the current experimental daemon, move the content in
/var/lib/docker/0.0 back to
/var/lib/docker and then restart Docker with your non-user namespaces enabled Docker daemon binary.
Looking forward to hearing from users who have been eager for the day user namespace support was available in Docker. It’s here!
Why is this “a great feature as it allows our containers to continue running with root privileges, but without actually having any root privilege on the host”? Why are we favoring to use `root` as a user by default in containers at all? Isn’t this is tremendous security issue?
If it stays like this, the user namespace feature only reduces the risk for the host system (i.e. the Docker cloud providers). What is with the risks the application developers run, because they use `root` instead of unprivileged users inside their containers? Shouldn’t we educate everyone to stop using `root` unless absolutely necessary?
The simple answer: some processes will require root privilege to perform their necessary actions (e.g. listen on a privileged port). If that’s the case, then user namespace-provided isolation sandboxes this “special root privilege” inside the user namespace without the container having real root privilege on the host.
While that’s the case, it is very true (and recommended in the Docker documentation) that Dockerfile authors *should* use the `USER` command to switch to a non-root user. However, even with many years of recommending that course of action, very few images actually follow that advice, and so, again, pragmatically user namespaces provides a solution for this case.
Of course it’s true that there are debates about the inherent security of the user namespaces implementation in the kernel, and we would have to go kernel version by version to look at some of those edge cases and issues. I don’t think the debate will end anytime soon, so using this feature/capability should be done with knowledge and research of the potential issues. However, with a very recent kernel, I’m not sure there are increased risks in a majority of cases by using reduced root privilege as an application developer. User namespaces also provide the added benefit in the future of having non-overlapping ranges per tenant; allowing cloud providers to fully cordon off filesystem access across multiple tenants, even if they use root or id 400 or id 4000 as the containerized process owning user.