As the dust is settling after Dirty COW was made public on the 19th of October, it is time to reflect on why it happened and what can be done to mitigate the effects of similar events in the future.
Dirty COW stands for “Dirty Copy On Write”, which is a Linux kernel mechanism to handle writes to user pages. It is a fancy tag for CVE-2016-5195. I don’t like calling exploits by name because it turns the security process into a media event, giving disproportional attention to the vulnerabilities with the catchiest names. However, in this case, the attention is well deserved as the vulnerability is critical. CVE-2016-5195 is a Linux kernel vulnerability that enables unprivileged local users to “gain write access to otherwise read-only memory mappings and thus increase their privileges on the system”. In other words, it allows local users to become root.
Vulnerabilities in Open Source projects are often discovered and reported by security researchers analyzing the codebase. Unfortunately, that is not what happened in this case. Phil Oester was carefully examining the HTTP traffic to one of his machines, which seemed to be compromised when he recognized a binary among the data sent to the server. He managed to extract the code and cautiously ran it in a safe test environment. It couldn’t have taken long to realize that he had an exploit in his hands, one that could work on up-to-date Linux distros. The exploit was entirely reproducible and relied on a bug present in the Linux kernel since September 2007. Black hats might have been using it for years before Phil reported the vulnerability. This scenario is the worst case, where no amount of pre-disclosure can limit exposure, as the bug is already being exploited in the wild.
= Inner workings of an exploit =
The vulnerability is a race condition in Linux’s copy on write implementation. Readers uninterested in deep technical details should skip to the next section.
The original exploit mmaps a setuid binary as read-only, then it writes to the mapped pages in memory via /proc/self/mem, while on another thread it issues madavise(MADV_DONTNEED) repeatedly, which tells the kernel that actually those pages are not going to be used. MADV_DONTNEED is supposed to be just a performance hint, which is implemented by dropping the process’s mappings to free resources. If the process starts to use the memory range again, the kernel is supposed to create new mappings. But that is not what happens here. In this case, due to a bug, the Linux kernel drops the process’s private mappings of the setuid binary, while it is also attempting to write to them, resulting in the original set of pages being modified, which should neither be allowed nor possible.
After the vulnerability was made public, a new exploit was published by @scumjr_ which uses ptrace and vDSO instead of /proc/self/mem and a setuid binary. The new exploit works from within any Linux container. ptrace is a system call that can be used to “trace” another program. It is principally a debugging and performance measurement tool. ptrace allows the caller to do many things, including copying data to a given memory address of the target process. vDSO stands for virtual Dynamically linked Shared Object: it is a Linux kernel mechanism to reduce the execution time of a small set of syscalls, typically timekeeping calls. Functions such as gettimeofday are called very frequently by userspace programs and libraries, including glibc. To avoid the overhead of calling into the kernel for every gettimeofday call, Linux offers an implementation of it in userspace in a read-only memory area called vDSO. Programs just need to call a routine at the right address, and they get the current time, without making any system calls. Very convenient.
@scumjr_’s exploit works as follow: it opens a server socket, then spawns two threads. One thread keeps calling madavise(MADV_DONTNEED), same as the other exploit, another thread concurrently attempts to modify the vDSO area using ptrace. It tries to introduce code to open a socket and execute arbitrary commands. Then it waits for connections. Thanks to the Linux kernel vulnerability, the exploit succeeds in modifying the vDSO area, which really should not be possible. As vDSO is shared by all programs inside or outside the container, soon something running as root on the host will try to get the time, but instead it will execute the code planted there by the exploit. The unaware program will end up making a socket connection to the exploit and running commands outside the container on behalf of the attacker. Brilliant.
= The exploit in action =
The following video shows the exploit running from within a Docker container:
At the end of the video we see “failed to win race condition” because the exploit is not always capable of recognizing whether it has been successful, but in this case, it clearly was. The file “this_is_the_host” is only present on the host filesystem and should not be accessible from the container. In fact, after “0xdeadbeef” is called, all commands are executed on the host, demonstrating container breakout.
The exploit relies on a single Linux kernel instance being shared across all containers and the host, which is why containers are a poor isolation technique. Attackers only need one bug in Linux to take over the whole system. This attack would not be allowed to happen if each container had its separate kernel instance. In fact, this class of problems can be entirely avoided by using lightweight virtual machines and PV Calls to run Docker applications.
= PV Calls =
In the PV Calls model, each container runs as a separate virtual machine, each with its independent Linux instance. Traditionally, virtualization has always implied a non-negligible IO overhead. The idea behind PV Calls is to move virtualization up the stack, by virtualizing syscalls instead of hardware devices. Syscalls are higher level and made for software, thus easier to virtualize, leading to higher performance. Specifically, if we use PV Calls to forward networking syscalls to the host, we skip the guest network stack entirely. Benchmarks show up to 4X network bandwidth compared to the traditional Xen networking PV drivers. See this article (https://blog.xenproject.org/2016/08/30/pv-calls-a-new-paravirtualized-protocol-for-posix-syscalls/) for more information.
As each Docker application runs as a separate virtual machine, exploits such as the one described in this article do not work. Any kernel privilege escalation vulnerabilities only allow an attacker to take control over its virtual machine. Other container applications, running on separate VMs, are unaffected. The following video shows the behavior of the same exploit when running with PV Calls:
The exploit cannot escape the container. “this_is_the_host” is still unaccessible. Commands executed after “0xdeadbeef” are run within the same unprivileged container. I even had to write a script that calls “date” every second so that something would execute the code planted by “0xdeadbeef”, connecting to the open socket. Otherwise, if I didn’t run the script, as there are no other processes within the container, nobody would end up calling the new code and connecting to the exploit. “0xdeadbeef” would be left waiting for connections. Safe but anti-climatic.
= What next? =
Another bug like the one that paved the way to Dirty COW will happen again. It will receive a different CVE number and a new fancy name, but it will be just as dangerous. If we look at the frequency of Linux kernel privilege escalation vulnerabilities in the past, it will likely happen within the next six months. But no need for panic or late night calls. We have the tools and the knowledge to minimize its effects; we just have to use them