Alex Hudson

Thoughts on Technology, Product, & Strategy

Month: May 2013

A first look at docker.io

In my previous post about virtualenv, I took a look at a way of making python environments a little bit more generic so that they could be moved around and redeployed at ease. I mentioned docker.io as a new tool that uses a general concept of “containers” to do similar things, but more broadly. I’ve dug a bit into docker, and these are my initial thoughts. Unfortunately, it seems relatively Fedora un-friendly right now.

The first thing to look at is to examine what, exactly, a “container” is. In essence, it’s just a file system: there’s pretty much nothing special about it. I was slightly surprised by this; given the claims on the website I assumed there was something slightly more clever going on, but the only “special sauce” is the use of aufs to layer one file system upon another. So from the point of view of storage alone, there really isn’t much difference between a container and a basic virtual machine.

From the point of view of the runtime, there isn’t an awful lot of difference between a virtual machine and a container either. docker sells itself as a lightweight alternative to virtual machines, but of course there is no standard definition of a “virtual machine”. At one end of the spectrum are the minimal hardware OSen that can be used to assign different host resources, including CPU cores, to virtual machines, and those types of VM are effectively not much different to real hardware – the configuration is set on the fly, but basically it’s real metal. On the other end of the spectrum you have solutions like Xen, which make little to no use of the hardware to provide virtualisation, and instead rely on the underlying OS to provide the resources that they dish out. docker is just slightly further along the spectrum than Xen: instead of using a special guest kernel, you use the host kernel. Instead of paravirtualisation ops, you use a combination of cgroups and lxc containers. Without the direct virtualisation of hardware devices, you don’t need the various special drivers to get performance, but there are also fewer security guarantees.

There are a couple of benefits of docker touted, and I’m not totally sold on all of them. One specific claim is that containers are “hardware independent”, which is only true in a quite weak way. There is no specific hardware independence in containers that I can see; except that docker.io only runs on x86_64 hardware. If your container relies on having access to NX bit, then it seems to me you’re relying on the underlying hardware having such a feature – docker doesn’t solve that problem.

The default container file system is set up to be copy-on-write, which makes it relatively cheap diskspace-wise. Once you have a base operating system file system, the different containers running on top of it are probably going to be pretty thin layers. This is where the general Fedora un-friendliness starts, though: in order to achieve this “layering” of file systems, docker uses aufs (“Another Union File System”), and right now this is not a part of the standard kernel. It looks unlikely to get into the kernel either, as it hooks into the VFS layer in some unseemly ways, but it’s possible some other file system with similar functionality could be used in the future. Requiring a patched kernel is a pretty big turn-off for me, though.

I’m also really unsure about the whole idea of stacking file systems. Effectively, this is creating a new class of dependency between containers, ones which the tools seem relatively powerless to sort out. Using a base Ubuntu image and then stacking a few different classes of daemon over it seems reasonable; having more than three layers begins to seem unreasonable. I had assumed that docker would “flatten out” images using some hardlinking magic or something, but that doesn’t appear to be the case. So if you update that underlying container, you potentially break the containers that use it as a base – it does seem to be possible to refer to images by a specific ID, but the dockerfile FROM directive doesn’t appear to be able to take those.

The net result of using dockerfiles appears to be to take various pieces of system configuration out of the realm of SCM and into the build system. As a result, it’s a slightly odd half-way house between a Kickstart file and (say) a puppet manifest: it’s effectively being used to build an OS image like a Kickstart, but it’s got these hierarchical properties that stratify functionality into separate filesystem layers that look an awful lot like packages. Fundamentally, if all your container does it take a base and install a package, the filesystem is literally going to be that package, unpacked, and in a different file format.

The thing that particularly worries me about this stacking is memory usage – particularly since docker is supposed to be a lightweight alternative. I will preface this with the very plain words that I haven’t spent the time to measure this and am talking entirely theoretically. It would be nice to see some specific numbers, and if I get the time in the next week I will have a go at creating them.

Most operating systems spend a fair amount of time trying to be quite aggressive about memory usage, and one of the nice things about dynamic shared libraries is that they get loaded into process executable memory as a read-only mapping: that is, each shared library will only be loaded once and the contents shared across processes that use it.

There is a fundamental difference between using a slice of an existing file system – e.g., setting up a read-only bind mount – and using a new file system, like an aufs. My understanding of the latter approach is that it’s effectively generating new inodes, which would mean that libraries that are loaded through such a file system would not benefit from that memory mapping process.

My expectation, then, is that running a variety of different containers is going to be more memory intensive than a standard system. If the base containers are relatively light, then the amount of copying will be somewhat limited – the usual libraries like libc and friends – but noticeable. If the base container is quite fat, but has many minor variations, then I expect the memory usage to be much heavier than the equivalent.

This is a similar problem to the “real” virtual machine world, and there are solutions. For virtual machines, the same-page mapping subsystem (KSM) does an admirable job in figuring out which sections of a VM’s memory are shared between instances, and evicting copies from RAM. At a cost of doing more compute work, it does a better job that the dynamic loader: shared copies of data can be shared too, not just binaries.  This can make virtual machines very cheap to run (although, if suddenly the memory stops being shareable, memory requirements can blow up very quickly indeed!). I’m not sure this same machinery is applicable to docker containers, though, since KSM relies on advisory flagging of pages by applications – and there is no application in the docker system which owns all those pages in the same way (for example) qemu would do.

So, enough with the critical analysis. For all that, I’m still quite interested in the container approach that docker is taking. I think some of the choices – especially the idea about layering – are poor, and it would be really nice to see them implement systemd’s idea of containers (or at least, some of those ideas – a lot of them should be quite uncontroversial). For now, though, I think I will keep watching rather than doing much active: systemd’s approach is a better fit for me, I like the additional features like container socket activation, and I like that I don’t need a patched kernel to run it. It would be amazing to merge the two systems, or at least make them subset-compatible, and I might look into tools for doing that. Layering file systems, for example, is only really of interest if you care a lot about disk space, and disk space is pretty cheap. Converting layered containers into systemd’able containers should be straightforward, and potentially interesting.

packaging a virtualenv: really not relocatable

My irregular readers will notice I haven’t blogged in ages. For the most part, I’ve been putting that effort into writing a book – more about this next week – hopefully back to normal service now though.

Recently I’ve been trying to bring an app running on a somewhat-old Python stack slightly more up-to-date. When this app was developed, the state of the art in terms of best practice was to use operating system packaging – RPM, in this case – as the means by which the application and its various attendant libraries would be deployed. This is a relatively rare mode of deployment even though it works fantastically well, because many developers are not happy maintaining the packaging-level skills required to maintain the system. From what I read the Mozilla systems administrators deploy their applications using this system.

For various reasons, I needed to bring up an updated stack pretty quickly, and spending the time updating the various package specifications wasn’t really an option. It didn’t need to be production rock-solid, but it needed to be deployable on our current infrastructure. The approach that I took was to build a packaged virtualenv Python environment: I’ve read online about other people who had tried it to relative success, although there are not many particularly explicit guides. So, I thought I would share my experiences.

The TL;DR version of this is that it was actually a relatively successful experiment: relying on pip to grab the various dependencies of the application meant that I could reliably build a strongly-versioned environment, and packaging the entire environment as a single unit reduced the amount of devops noodling. There is a significant downside: it’s a pretty severe mis-use of virtualenv, and it requires some relatively decent understanding of the operating system to get past the various issues.

Developing the package

As I have a Fedora background, I’m not really happy slapping together packages in hacky ways. One of the things I’m definitely not happy doing is building stuff as root: it hides errors, and there’s pretty much no good reason to do anything as root these days.

In order to build a virtualenv you have to specify the directory in which it gets built, and without additional hacks that’s not going to be the directory to which it installs. So, the “no root build” thing immediately implies making the virtualenv relocatable.

The web page for virtualenv currently has this sage warning:

“The --relocatable option currently has a number of issues, and is not guaranteed to work in all circumstances. It is possible that the option will be deprecated in a future version of virtualenv.”

Wise words indeed. There are a tonne of problems moving a virtualenv. Encoding the file paths directly into files is an obvious problem, and virtualenv makes a valiant attempt at fixing up things like executable shebangs. It doesn’t catch everything, so some stuff has to be rewritten manually (by which I mean, as part of the RPM build process – obviously not doing it by hand).

Worse still, it actively mangles files. Consider one of pillow’s binaries, whose opening lines become:

#!/usr/bin/env python2.7

import os; activate_this=os.path.join(os.path.dirname(⏎
os.path.realpath(__file__)), 'activate_this.py');⏎
execfile(activate_this, dict(__file__=activate_this));⏎
del os, activate_this

from __future__ import print_function

Unfortunately this is just syntactically invalid python – future imports have to come first. Again, it’s fixable, but it’s manual clean-up work post-facto.

What to do about native libraries

Attempting to use python libraries having native portions, be it bindings or otherwise, is also an interesting problem. To begin with, you have to assume a couple of things: that native code will end up in the package, and not all of it will be cleanly built. The obvious example of both those rules is that the system binary python is copied in.

This causes problems all over the shop. RPM will complain, for example, that the checksum of the binaries don’t match what it was expecting: this is because it reads the checksum from the binary directly rather than calculate it at package time, and prelink actually alters the binary contents (this happens after the RPM content is installed, but RPM ignores those changes for the purposes of its package verification).

Another example of native content not playing well with being packaged is that binaries will quite often have an rpath encoded into them. This is used when installing into non-standard locations, so that libraries can be easily found without having to add each custom location into the link loader search path. However, RPM rightly objects to them. It’s possible to override RPM’s checks, but that’s pretty naive. Keeping rpaths means bizarre bugs turn up when the paths actually exist (e.g., installing the environment package on the development machine building the package – which is quite plausible, given the environment package may end up being a build-time dependency of another).

Thankfully, binaries can usually be adjusted after the fact for both these things; it’s possible to remove the rpaths encoded into a binary, and undo the changes prelink.

In the end, I actually made a slightly hacky choice here too: I decided that the virtualenv would allow system packages. This was the old default, but is no longer because it stops the built environments being essentially self-contained. This allowed me to build certain parts of the python stack as regular RPMs (for example, the MySQL connector library) and have that be available within the virtualenv. This is only possible if there is going to be one version of python available on the system (unless you build a separate stack on a separate path – always possible), and takes away many of the binary nasties, since the binary compilation process is then under the control of RPM (which tends to set different compiler flags and other things).

The obvious downside to doing that is that system packages are already fulfilled when you come to build the virtualenv, meaning that the virtualenv would not be complete. If that’s the intention that’s ok, but that’s not always what’s wanted. I resorted to another hack: building the virtualenv without system packages, and then removing the no-global-site-packages flag manually. This means you have to feed pip a subset of the real requirements list, leaving out those things that would be installed globally, but that seemed to work out reasonably well for me.

The rough scripts that I used, then, were these. First, the spec file for the environment itself:

%define        ENVNAME  whatever
Source:        $RPM_SOURCE_DIR/pyenv-%{ENVNAME}.tgz
BuildRoot:     %{_tmppath}/%{buildprefix}-buildroot
Provides:      %{name}
Requires:      /usr/bin/python2.7
BuildRequires: chrpath prelink

%description
A packaged virtualenv.

%prep
%setup -q -n %{name}

%build
rm -rf $RPM_BUILD_ROOT
mkdir -p $RPM_BUILD_ROOT%{prefix}
mv $RPM_BUILD_DIR/%{name}/* $RPM_BUILD_ROOT%{prefix}

# remove some things
rm -f $RPM_BUILD_ROOT/%{prefix}/*.spec

%install
# undo prelinking
find $RPM_BUILD_ROOT/opt/pyenv/%{ENVNAME}/bin/ -type f -perm /u+x,g+x -exec /usr/sbin/prelink -u {} \;
# remove rpath from build
chrpath -d $RPM_BUILD_ROOT/opt/pyenv/%{ENVNAME}/bin/uwsgi
# re-point the lib64 symlink - not needed on newer virtualenv
rm $RPM_BUILD_ROOT/opt/pyenv/%{ENVNAME}/lib64
ln -sf /opt/pyenv/%{ENVNAME}/lib $RPM_BUILD_ROOT/opt/pyenv/%{ENVNAME}/lib64

%clean
rm -rf $RPM_BUILD_ROOT

%files
%defattr(-,root,root)
%{prefix}opt/pyenv/%{ENVNAME}

(Standard files like name and version are missing – using the default spec skeleton fills in the missing bits). It’s not totally obvious from this, but I actually ended up building the virtualenv first and using that effectively as the source package:

virtualenv --distribute $(VENV_PATH)
. $(VENV_PATH)/bin/activate && pip install -r requirements/production.txt
virtualenv --relocatable $(VENV_PATH)
find $(VENV_PATH) -name \*py[co] -exec rm {} \;
find $(VENV_PATH) -name no-global-site-packages.txt -exec rm {} \;
sed -i "s|`readlink -f $(VENV_ROOT)`||g" $(VENV_PATH)/bin/*
cp ./conf/pyenv-$(VENV_NAME).spec $(VENV_ROOT)
tar -C ./build/ -cz pyenv-$(VENV_NAME) > $(VENV_ROOT).tgz
rm -rf $(VENV_ROOT)

 Improving on this idea

There’s a lot to like about this kind of system. I’ve ended up at a point where I have a somewhat bare-bones system python packaged, with a few extras, and then some almost-complete virtualenv environments alongside to provide the bulk of dependencies. The various system and web applications are packaged depending on both the environment and the run-time. The environments tend not to change particularly quickly, so although they’re large RPMs they’re rebuilt infrequently. I consider it a better solution than, say, using a chef/puppet or other scripted system to create an environment on production servers, largely because it means all the development tools stay on the build systems, and you can rely on the package system to ensure the thing has been properly deployed.

However, it’s still a long, long way from being perfect. There are a few too many hacks in the process for me to be really happy with it, although most of those are largely unavoidable one way or another.

I also don’t like building the environment as a tarball first. An improvement would be to move pretty much everything into the RPM specfile, and literally just have the application to be deployed (or, more specifically, its requirements list) as the source code. I investigated this briefly and to be honest, the RPM environment doesn’t play wonderfully with the stuff virtualenv does, but again these are probably all surmountable problems. It would then impose the more standard CFLAGS et al from the RPM environment, but I don’t know that it would end up removing too many of the other hacks.

The future

I’m not going to make any claims about this being a “one true way” or some such – it clearly isn’t, and for me, the native RPM approach is still measurably better. Yes, it is slightly more maintenance, but for the most part that’s just the cost of doing things right.

What is interesting is that this kind of approach seems to be the way a number of other systems are going. virtualenv has been so successful that it’s now effectively a standard piece of python, and rightly so – it’s an incredible tool. Notably, pyenv (the new tool) does not have the relocatable option available.

I’m slightly excited about the docker.io “container engine” system as well. I haven’t actually tried this yet, so won’t speak about it in too concrete terms, but my understanding is that a container is basically a filesystem that can be overlaid onto a system image in a jailed environment (BSD readers should note I’m using “jail” in the general sense of the word – sorry!). It should be noted that systemd has very similar capability in nspawn too, albeit less specialist. Building a container as opposed to an RPM is slightly less satisfying: being able to quickly rebuild small select portions of a system is great for agile development, and having to spin large chunks of data to deploy into development is less ideal, but it may well be the benefits outweigh the costs.