User namespaces have arrived in Docker!

by estesp · Published October 13, 2015 · Updated February 8, 2016

02/08/2016 UPDATE: Check out my latest blog post on security and user namespaces now that Docker 1.10 is officially released.

01/13/2016 UPDATE: User namespaces have migrated out of experimental for inclusion in the Docker 1.10 release slated for February 2016. Minor changes have been made but the following post is still effectively correct. The most significant change is that non-user namespace cached content (e.g. image layers) are no longer migrated to /var/lib/docker/0.0 but left in the root of the graph root directory, allowing for smooth upgrade/downgrade to and from 1.10 without impact.

Well, after a number of speed bumps along the way and a more than trivial number of discussions on the questions of how to implement and how to expose to users, I am happy to announce that our initial approach to user namespaces in Docker containers is now available. As announced in the tweet below from Arnaud, for the Docker 1.9 release timeframe, the ability to enable user namespaces will be available only in the experimental build while we understand more completely the edges of the functionality and how it interacts with other key features in the Docker ecosystem.

Phase 1 @Docker support for user namespaces is now in experimental branch! Congratulations @estesp! ? https://t.co/a25LizygRV

— Arnaud Porterie (@icecrime) October 10, 2015

Of course, your first question after all this excitement may be “So what are user namespaces?” User namespaces are simply another containment namespace available in the Linux kernel (and the youngest of them all), similar to the key namespaces–like mount, uts, pid, and network–which are used to create these things we call containers. If namespaces in general are a new topic, Michael Crosby‘s “Creating Containers” post is a worthy read on the topic. For more specific details on the user namespace functionality in Linux, the “Namespaces in Operation, part 5” entry by Michael Kerrisk at LWN.net is a great read. Once you’ve brushed up on namespaces in general, and learned a bit more about user namespaces, you’ll probably understand that one of the most important features of user namespaces is that it allows containers to have a different view of the uid and gid ranges than the host system. Specifically, a process (and in our case, the process(es) inside our container) can be provided a set of mappings from the host uid and gid space, such that when the process thinks it is running as uid 0 (commonly known as “root”), it may actually be running as uid 1000, or 10000, or even 34934322. It all depends on the mappings we provide when we create the process inside a user namespace. Of course, it should be clear that from a security perspective this is a great feature as it allows our containers to continue running with root privileges, but without actually having any root privilege on the host.

In Docker’s case, we have provided a new daemon startup flag allowing the administrator to provide the name of a user, and optionally a group (or that user or group’s numeric ID), as the remapping user. Because the creators of the Linux user namespace functionality already thought through how to provide what they called subordinate ranges of user and group numeric IDs to a specific user using the /etc/subuid and /etc/subgid files, Docker then looks for that remapping user‘s ranges in the subordinate range files. Those ranges are then used to create the maps during process creation when applications are started either via docker run or docker exec on this daemon instance. More complete documentation is available on exactly how Docker uses these ranges in the user namespace experimental documentation.

Of course, the next step that users (and container cloud operators specifically) will want is the ability to specify mappings at the per-container level rather than per-daemon. In the case of a public cloud operator this would allow for the added security of each tenant having their own uid and gid ranges which have no overlap with other tenants–and of course we want to provide that ability. However, the critical missing functionality at this point is the ability to mount filesystems on Linux with a “shift offset” for the ownership information of each filesystem entry. Without this capability, Docker would have to “unshare” the filesystem components that today are shared using the various copy-on-write backend filesystem drivers available in Docker. This would mean a loss of the disk space savings that is provided in Docker: filesystem layers–for example the layers of the ubuntu:14.04 image–are shared between containers. Without a mount-with-uid-gid shift-like function, Docker would need to copy each root filesystem and then chown the entire tree of files using this shift-offset manually to get the ownership to match the specific mappings desired for the container process which will act on the filesystem. There is work going on in the Linux community to enable this behavior at the filesystem level, but it is unclear what the timeframe for this will be, and therefore, the “Phase 2” of user namespaces is hard to predict as far as when it can be implemented properly.

For those who are already playing with user namespace functionality, please provide feedback on this currently-experimental feature in Docker so we can continue shaping it for future stable releases. If this all sounds interesting, but you don’t know where to start, read through the documentation on user namespaces, and then grab an experimental build from experimental.docker.com and try it out!

One important note: due to the need to segregate content in the Docker daemon’s local cache of layer data by the mappings provided, once you use an experimental build with user namespaces, the root of your graph directory (/var/lib/docker by default) will have one additional level of indirection which correlates to the remapped root uid and gid. For example, if the remapping user I provide to the --userns-remap flag has subordinate user and group ranges that begin with ID 10000, then the root of the graph directory for all images and containers running with that remap setting will reside in /var/lib/docker/10000.10000. If you use the experimental build but don’t provide user namespace remapping, your current content will be migrated to /var/lib/docker/0.0 to differentiate it from remapped layer content. If you want or need to return to a Docker build which is not enabled for user namespaces, the simplest method to return is to stop the current experimental daemon, move the content in /var/lib/docker/0.0 back to /var/lib/docker and then restart Docker with your non-user namespaces enabled Docker daemon binary.

Looking forward to hearing from users who have been eager for the day user namespace support was available in Docker. It’s here!

Tags: Linux security user namespaces

Peter says:

July 3, 2017 at 8:38 am

Why is this “a great feature as it allows our containers to continue running with root privileges, but without actually having any root privilege on the host”? Why are we favoring to use `root` as a user by default in containers at all? Isn’t this is tremendous security issue?

If it stays like this, the user namespace feature only reduces the risk for the host system (i.e. the Docker cloud providers). What is with the risks the application developers run, because they use `root` instead of unprivileged users inside their containers? Shouldn’t we educate everyone to stop using `root` unless absolutely necessary?

Reply
- estesp says:
  
  July 6, 2017 at 3:40 pm
  
  The simple answer: some processes will require root privilege to perform their necessary actions (e.g. listen on a privileged port). If that’s the case, then user namespace-provided isolation sandboxes this “special root privilege” inside the user namespace without the container having real root privilege on the host.
  
  While that’s the case, it is very true (and recommended in the Docker documentation) that Dockerfile authors *should* use the `USER` command to switch to a non-root user. However, even with many years of recommending that course of action, very few images actually follow that advice, and so, again, pragmatically user namespaces provides a solution for this case.
  
  Of course it’s true that there are debates about the inherent security of the user namespaces implementation in the kernel, and we would have to go kernel version by version to look at some of those edge cases and issues. I don’t think the debate will end anytime soon, so using this feature/capability should be done with knowledge and research of the potential issues. However, with a very recent kernel, I’m not sure there are increased risks in a majority of cases by using reduced root privilege as an application developer. User namespaces also provide the added benefit in the future of having non-overlapping ranges per tenant; allowing cloud providers to fully cordon off filesystem access across multiple tenants, even if they use root or id 400 or id 4000 as the containerized process owning user.
  
  Reply

Docker: Are we there yet? | Moshe'z

February 1, 2016

[…] User namespaces — slated to land in February 2016, so pretty close. […]
Docker gets security enhancements – CSC Blogs

February 5, 2016

[…] This one has been a long time desired by many, and it improves the ability to create specific access policies through using multiple namespaces within a Docker host. Here is a demo. User Namespaces have been available for awhile in experimental, here is a good post that describes them. […]
Docker 1.10: Security and User Namespaces - Integrated Code

February 5, 2016

[…] on since last spring is officially part of the Docker runtime! I’ve already written a long blog post about user namespaces in Docker when it became part of the 1.9-era experimental builds a few months […]
Docker Security – part 2(Docker Engine) | Sreenivas Makam's Blog

March 6, 2016

[…] User namespaces in Docker […]
Case study: securing containers and payloads on the Menagerie system | developerWorks Open

March 15, 2016

[…] 1.10 is running under a separate user namespace. There is an excellent explanation on this in this blog – essentially the containers are executed where internal UIDs are mapped to an external UID range […]
Introduction to Docker Security — GracefulSecurity

May 8, 2016

[…] Remapping uids so that the container uid 0 does not match the hosts uid 0; […]
User Namespaces: 2017 Status Update and Additional Resources – Integrated Code

February 24, 2017

[…] My original blog post on the topic from October 2016 when user namespace support went into experimental around the Docker 1.9 release. Some design changes were made by the time Docker 1.10 released the capability outside of experimental, but for better or worse it is still the most read blog post on my site! […]
What is /var/run/docker.sock on Docker - FoxuTech

January 2, 2018

[…] Not too distant releases of Docker will probably alleviate some of the risk involved in sharing /var/run/docker.sock with containers. One very promising solution uses namespaces and is in the Docker 1.9.0 experimental build. If you want to know more, here’s a great read on Docker namespaces. […]
Docker security – CTQ

January 11, 2019

[…] Refer to the daemon command in the command line reference for more information on this feature. Additional information on the implementation of User Namespaces in Docker can be found in this blog post. […]
Docker security tools and stuff to read – Aironman techblog

September 10, 2021

[…] https://integratedcode.us/2015/10/13/user-namespaces-have-arrived-in-docker/ […]

User namespaces have arrived in Docker!

You may also like...

12 Responses

Leave a Reply to estesp Cancel reply

Recent Posts

Twitter

Recent Comments

Archives

Categories

Meta

User namespaces have arrived in Docker!

You may also like...

containerd Graduates in the CNCF!

Why I love containerd…and Docker!

DockerHub Official Images Go Multi-platform!

12 Responses

Leave a Reply to estesp Cancel reply

Recent Posts

Twitter

Recent Comments

Archives

Categories

Meta