Habitus: the right way to build containers

So, after my previous slightly ranty post, I’ve been trying out a few different tools and approaches to building containers, attempting to find something which is closer to my idea of what good looks like. One tool stands out from the rest: Habitus.

(Not to be confused with Habitat, an annoyingly similar project with an annoyingly similar name. I have no idea which came first, suffice to say, I had heard of Habitat before and discounted it as being irrelevant to my use cases – and therefore almost overlooked Habitus during my research)

Habitus provides just-enough-Make to bring some sanity to the Docker build process, with the following killer features:

  1. ability to order container builds by expressing a dependency from one to another
  2. first-class support for artefacts created during the container build process, which can be extracted and used as input for later builds
  3. management API to provide build-time secrets into the containers

It’s not all sunbeams and kittens, but functionally this is a truly excellent start. In order to separate out my build and runtime containers, I had previously resorted to using make, doing something like the following to extract a build artefact in a two-step process (this is – obviously – a node.js app):

node_modules.tgz: Dockerfile.build
    docker build -t myapp/build:latest -f Dockerfile.build  .
    docker run myapp/build tar -cz -C /srv/app node_modules > node_modules.tgz

app: Dockerfile node_modules.tgz
    docker build -t myapp/app -f Dockerfile  .

So I would run make app and this would kick off a build process that would create some output – like the node_modules installation – that would be captured as an artefact for use in the main build, by manually running tar out of the container itself. Remember, the point of this is that my npm install process is creating some native libraries and things, so my build container has all the build-essential toolchain and whatnot that I really don’t want in my production app container.

In Habitus-speak, the build.yml file – which looks a bit like compose.yml if you squint – looks like this instead:

build:
  version: 2016-03-14
  steps:
    builder:
      name: myapp/builder
      dockerfile: Dockerfile.build
      artifacts:
        - /srv/app/node_modules.tgz
    deployment:
      name: myapp/app
      dockerfile: Dockerfile
      depends_on:
        - builder
      cleanup:
        commands:
          - apt-get clean autoclean
          - apt-get autoremove -y
          - rm -rf /var/lib/{apt,dpkg,cache,log}/

At first glance it’s not totally obvious why this is better – it’s more verbose – but this is just a really simple example. The Habitus approach scales much better, it’s better documented, and you don’t have to play lots of different tricks (or teach fellow devs Make syntax…). You can have multiple artefacts, as mentioned you can inject secrets (this example doesn’t demonstrate that), and in fact Habitus is also going to do a couple other things for you:

  • image squashing kind of comes for free. That, combined with the cleaner build system, saved me a couple hundred Mb up-front in container size (admittedly, this was a trivial container with obvious problems – point is, I didn’t have to hand-optimize this)
  • you can add clean-up steps to containers. See my myapp/app container above – you can throw in a bunch of container-trimming commands, and with the image squashing in play this means you get actual size reduction. Much better than attempting to stuff as much logic into single RUN invocations as possible.

After all that good stuff, what are the downsides? Well, there are a few. The first is that this isn’t built into Docker – you need to go grab Habitus separately. Thankfully, it’s a stand-alone Go binary, so this is no hard task. Personally I think it’s crazy that this isn’t part of the base but there we go.

Building in Habitus is generally a lot slower. This is due to a couple of things: obviously, some of the things it is doing are more sophisticated, so that takes time, but it also encourages you to split things out into different container steps, and each one of those adds overhead. It’s not a lot slower, though, and (dependencies-allowing) it will build things in parallel.

Lastly, the documentation actually isn’t all that great. If you’re comfortable piecing things together it’s ok, but there’s not an awful lot of support. For example, running the thing in the first instance is surprisingly complicated – it assumes that you’re running docker-machine or similar, and that you have certain things in the environment. It will throw away “unused” containers by default too, and some of the other defaults are a bit suspect. Working on a basic development system, the correct invokation for me is:

sudo habitus --use-tls=false --host=unix:///var/run/docker.sock --binding=127.0.0.1 --noprune-rmi

This will vary from installation to installation, though, and the various messages that Habitus comes out with don’t point the finger in the right direction all the time.

Also, if you make mistakes in your build.yml, a lot of the messages can be quite cryptic there too. I accidentally removed some artefact files in a cleanup step, for example, and got a pretty bizarre message for my trouble:

2016/11/30 20:39:27 ▶ Starting container a4b9dd5eeb2ac86ce978040a6b7ec94e94ac69dc208 to fetch artifact permissions 
stat: cannot stat '/srv/app/node_modules': No such file or directory
2016/11/30 20:39:42 ▶ Failed to fetch artifact permissions for /srv/app/node_modules: strconv.ParseInt: parsing "": invalid syntax

So it fails with “invalid syntax”, but the real error is on the line above. Luckily, this stuff is not too difficult to grapple with if you’ve used tools like this before (let’s face it, the state of the art in this area is not particularly strong), but I would worry that some users will find the initial learning curve a bit steep until they get into a groove of building their containers.

I’m utterly convinced this is the right way to build containers at this point. It doesn’t solve everything – the Dockerfiles are still there rocking out like ’80s sh scripts – but the overall infrastructure Habitus provides is great.

 

The ongoing poverty of the Docker environment

I spent a few hours this weekend attempting to re-acquaint myself with the Docker system and best practices by diving in and updating a few applications I run. I wrote up an article no long after Docker’s release, saying that it looked pretty poor, and unfortunately things haven’t really changed – this doesn’t stop me using it, but it’s a shame that the ecosystem apparently has learnt nothing from those that came before.

There are, of course, certain things you can do to make your life easier: choosing Debian Jessie as a base apparently isn’t one of them. If you’re planning on launching node.js applications you have an entire world of pain to look forward to; my advice is to use ‘nodesource/jessie’ as a base and pretend all the stuff underneath doesn’t exist – attempting to use either the default Debian node or manage the installation of nodesource yourself just isn’t worth the hassle.

But anyway. The main concern is that the build process is so poor. You pretty much have a single environment which is going to serve has both build and runtime by default, and looking around at the various public containers you can see most of them are bloated with the flotsam and jetsam of the build process. Building still requires root by default too, and the vast majority of example Dockerfiles out there slap binaries together in this manner that is entirely reminiscent of wattle and daub plasterwork.

A great example of this is the build container or toolbox pattern. Most people don’t want an environment to serve for both build and runtime, so they separate the two and ensure that one container is used to create (build) the artefacts required at runtime, so as keeping the dependency chain separate and reducing the size of the output being pushed into production.

However, Docker gives you literally no tools to manage this cleanly. There is no build pipeline, so you have to create it yourself. How are you going to transfer the build artefacts from one container to another? You can create a shared volume between the containers, but that makes deployment unnecessarily complicated. You can also use docker cp to move artefacts out of a built image. My personal choice is to extract it using docker run, like this:

docker run somenodeapp/build tar -c -C /srv/app node_modules > node_modules.tar

This works pretty great, except that you cannot integrate it into the actual Docker build process: you cannot express a dependency from the build container to the app container, and you cannot pull in the build artefacts. So, I end up with some external tool, like make. You don’t get the full benefit of dependency caching, though, because Docker only has a concept of caching layers. Also, with multiple builds happening, your working directory grows with the various build artefacts, and every time you start a new build it’s sending a bunch more data to the Docker daemon (bizarrely, even now, it’s common place to build apps as root and forward the data to the main daemon to produce an image – although bubblewrap is close to getting to a proper solution again it seems).

The Dockerfile build process itself is also super-simplistic. I suspect this is why people get started quickly with it – although I don’t think having a shallow learning curve in general precludes a process from being technically good. It’s essentially throwing together a packaged Linux system using a simple shell script, like the rc files of old. It’s like dpkg and rpm were never invented; and all the problems you would expect in terms of rolling your own are there: the functional atoms (copy/add a file, etc.) are basic.

A lot of Docker devotees claim that Docker removes the need for system configuration management. This is sadly untrue: it’s there, and you have to do it using the basic tools that Dockerfile gives you. You’ll often see a lot of sed and awk in Dockerfiles, or plain overwriting system files, because none of the finer-grained tools like ansible/augeus and friends are there by default (although I note you can now build Docker containers with ansible – which is something I intend to try, as it is a much more reasonable approach).

There are some similar systems out there which highlight the major difference in quality. flatpak and associated tooling is much more influenced by traditional package management and systems concerns, and it shows – it’s much better thought-through, with a full theory on how to divide up the applications and a build process to match.

I don’t personally understand why this tool has been built in this way, given all the good stuff already created and available with standard packaging tools. Linux packages are already very very close to the Docker concept of layers, and Dockerfiles are quite reminiscent of ancient versions of Kickstart. But, it misses all the stuff you’d take for granted as a Linux packager – clean build roots (the layer caching basically prevents this, but turning it off is an exercise in pain), user-mode building, build artefact and process management, etc. etc.

You might be asking at this point why I’m continuing to bother with Docker. Popularity is part of the answer – it’s a tool you have to know at this point, even though some of the time it feels like we’re back to the sysadmin stone age.

But that would be a bit trite. The real answer is that there are a couple of things that Docker has definitely got right. The main one isn’t due to Docker itself, but the concept of containers being largely stateless as a design process is powerful: I think Linux packaging could learn from this, rather than attempting to manage configuration in the way it does (and I think flatpak is a route forward here). Removing configuration and storage from scope allows for some interesting design choices, although these are largely for naught if you don’t have control over the software underlying (actually, it can make things more complex – but if you’re writing software, it encourages much better design practices, and passing secrets consistently through the environment is somewhat reasonable if not ideal).

The second one is that the contract or API between container(s) and the host is network ports (for the most part – you can of course do file system tricks and all sorts). This turns out to be a surprisingly good point of abstraction. I kinda disagree with the “one container, one process” dictum because it largely breaks this: for me, it should be “one container, two services”, where one of those services is a heartbeat. This is a good unit of composition, and frankly I don’t care whether service is one process or many, or threaded or whatever. And – to be clear – service doesn’t mean “self contained web application” to me, the various all-in-one containers for popular apps are common container disaster areas; my position is only a slightly nuanced version of the accepted best practice.

Now, the networking is far from perfect – years-old issues for basic functionality like being able to address the IP of the docker host go unsolved without hacks – and various “solutions” in general have been deprecated. Many orchestration tools, in the mean time, have gone forward and created their own solution to the networking conundrums people face, yet there is still no standard solid service discovery solution.

I personally think Docker is not long for this world. I do think, though, that a lot of the basic concepts are going to go forward – probably with smaller, more capable tools that better incorporate learning from traditional systems management – and like the world of Javascript, it’s pretty cool to have the orthodoxy turned upside down once in a while.

A short review: The Agile Team Onion

This is a quick and pithy review of Emily Webber’s free e-book, “The Agile Team Onion“. At about 20 pages of content, it’s a concise enough work itself – I personally appreciate the laser-like focus on a single subject; in this case, it’s thinking about the various factors that affect agile team make-up, sizing and interfacing with other people and teams.

A few negatives first, to get them out of the way. The big one for me is that the book is silent on the customer. This is a hobby-horse of mine, because I see this constantly: teams will have some kind of “stakeholder”, “customer champion”, or other stand-in for the real thing. I call these roles “decoy customers”: it distracts us from noticing that the team doesn’t actually have real contact with a customer, and (more importantly) these people are often more chicken than pig. Sales may be important to them, for example, but they’re not fundamentally representative of the users.

In my opinion: teams that do not have constant customer interaction (or, worse, are one or more steps removed from the customer) will build more features that customers do not value, than ones who do have that interaction.

Looking at the agile manifesto, we see the word “customer” appears twice (notably, in the first two principles) and “business people” only once. I’m not sure that’s deliberate, but I like that ratio. It’s sad in a sense that many manifestations of agile, such as scrum, don’t recognise the customer as a specific role in the team – I think that’s an oversight, brought about by the agile principles not really calling it out. The agile manifesto places great emphasis on “customer collaboration” and responding to change, and the best way I know to do that is to have customers actively engaged with the team in multiple ways.

Second, while the book places great emphasis on Dunbar’s number (rightly so), I think it misses other reasons to think about small team sizes. Small teams will communicate efficiently, as the book states, but they will also produce artefacts that are:

  • limited in scope in terms of their overall complexity (and therefore are likely to be more amenable to testing, service-oriented design and less likely to be too-tightly coupled)
  • abstracted from the overall design of the organisation.

That second point is subtle, but important. Conway’s law states “organizations which design systems [..] are constrained to produce designs which are copies of the communication structures of these organizations”.

Small agile teams that are fundamentally multi-disciplinary transcend the strict chain-of-command organisational boundaries, and allow end product (the software) to be designed in more broad, abstract ways. Do we break free of Conway’s law entirely? No, of course not: small teams with well thought out communication lines to other teams they rely on look an awful lot like microservices, so it’s no co-incidence that service-oriented architecture is all the rage, but there is a lot to be said for loose, independent coupling through well-designed APIs: I think this is effectively the software equivalent of the organisational structure that Webber argues in favour of.

So, onto the positives, and there’s a lot to like about this short book. As I said before, I think the brevity is highly appreciated, and there are plenty of points in here which should get people thinking about how their agile teams function currently.

Recognising silo thinking is absolutely crucial. I think everyone notices this in different ways according to their experience – mine is hearing within programme teams that certain features or requirements had been mandated by “the business”. Of course, I never managed to find out who that shadowy cartel was made up of – suffice to say they communicated solely through intermediaries – but it was a classic information silo in the way Webber describes.

I particularly like the accessibility of this book. I may well end up giving this to teams to help them think about their own team design – and I think this could be a useful tool to re-enforce self-organisation – but actually, I’m very interested to start sharing this with people outside of my agile teams. Not everyone understands the reasoning behind the team design and make-up, and there are some great easy-to-understand points in here which communicate this very well.

There are lots of useful questions in this book, which will really aid people both in the division of pigs/chickens, but also ensuring influence is measured and/or limited as necessary. There is always a massive risk of people senior on the organisation chart attempting to throw their 2p in at opportunities given to them, and being clear about who the decision makers are, how decisions are made and communication, and (importantly) when not to take decisions, is a very powerful tool that teams can use to shield themselves.

All in all, I’ll be recommending this book to people thinking through these problems: there are a lot of resources focussed on “how to be agile”, but very few on the “who”, and although I think that’s a relatively limited problem, I also think it’s crucial to get right.

Quote

“Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.” – first principle, Agile Software Manifesto (my emphasis)

Deadlines, estimates, predictions

Project management is always guaranteed to bring out some strong opinions, and a recent Twitter discussion was no different – but, while the core discussion on Twitter was great, it really deserves a much longer-form treatment. Paul Johnston wrote up his thoughts about getting people to talk about predictions instead of deadlines – and much of it is hard to argue with, but I have a bit of a different perspective.

To my mind, there tend to be approximately two types of deadlines:

  • fixed deadlines: these involve an independent event which must be met. A caterer who is supplying a wedding, for example, has a fixed deadline. In the product world, these tend to be rare – writing a mobile app for the Olympics 2016 would be a fine example, but writing a mobile app for Athletics news would not. Sales/marketing would definitely want it out before the Olympics, but that’s not the same thing.
  • soft deadlines: there is no independent event that must be met. This is the usual type of deadline; often a piece of work needs to be done for a deadline in order to co-ordinate with other work happening (press releases, other product development, rebranding exercises, etc)

The core difference between these two things is Paul’s “after deadline epoch”: in the case of a fixed deadline, we’re really saying (in the case of software) that the artefact will have little or no value past a given date. This is like a punnet of strawberries; once it’s gone past the use-by date, it’s no good.

In the soft deadline world, this is not the case – the potential value of the artefact is largely unchanged. However, value is only one side of the equation; a deadline moving backwards represents incrementally increasing cost. And this is where we get to the nub of the business problem: what is our opportunity cost?

A quick aside: I had a wonderful conversation with a recently-qualified project manager a few weeks ago. We were talking about the benefits and drawbacks of a framework like PRINCE2 for running projects, and to what extent using it in real-life looks like the theory. I made one claim that shocked her, though: that I had never, ever seen a project manager shut down a project of their own volition.

Those with extensive commercial experience are probably entirely unsurprised by that statement. But, to a PRINCE2 “theorist”, this is amazing: a core tenet of PRINCE2 is the continual checking and questioning of whether a project will deliver its aims, the planned value. This is called “Continued Business Justification”, and it should happen throughout the project lifecycle – and if there is no justification, you stop the project.

People who have interviewed project managers will often hear the refrain, “Well, my PID (Project Initiation Document) is my Bible”, but when talking of the PID the interviewee rarely mentions the business case – which is practically the most important document in PRINCE2

It’s easy enough to understand why. We’re all subject to the Sunk Cost fallacy, and particularly because it’s more obvious if projects are failing toward the end of their plan (because people are also great at deluding themselves), it means the sunk cost tends to be pretty large.

What do this have to do with deadlines? Well, when a business decides to initiate some work – whether it’s in a project format, or somehow else – it generally should have some kind of business case, whether formal or not. But if you cannot articulate the cost of doing something, it’s difficult to articulate the net value, and it’s difficult to compare to other options. That’s why people often want a deadline – the project costs need to be constrained, so that different options can be evaluated and the business can take a view on what to actually invest in.

Equally, when it comes to delivery, missing a deadline usually means additional cost (which implies declining business value), but the additional cost tends not to be incremental: it’s quite often catastrophic, in fact, particularly if recognised late in the day.

So, all this together, I think deadlines are inevitable, a fact of life, and I think in many cases the word “prediction” is wrong: while that’s fine for the development side of the deadline (it is indeed an estimate) it’s wrong to think of that goal as a whole as a prediction.

I think the better question is, in the context of software development, “How do we do this better?”. And there’s a few suggestions I have, most of which I try to put into practice (although often there are antagonizing factors that weaken them – that’s a whole blog post in itself).

First, separate the development estimates from the deadline. A set of development estimates should never, ever be a single value – it’s not possible to say “We think we can complete this functionality in two months”. We can (or, should be able to) make statements like, “On balance, we believe it is likely this functionality can be completed in two months” – and it should be possible to put some kind of likelihood on that. 90% probability sounds pretty good, but it means in practice we still expect one project in ten to be late, and most estimates are not 90% certainty. The rest of the business needs to understand that.

Figuring out the likelihood is tough. I like the “triangular estimate” practice for individual stories, but when combining these together we actually need a little bit of statistics. I prefer to pretend they’re a Poisson distribution (and this tends to fit in practice), but equally you could turn them into pure PDFs and run Monte Carlo. I could go into a series of posts about this, and probably will, but you can also just look at some historical projects to get a feel: take the estimates for previous projects, and compare to the actuals. You’ll have a probability distribution there that you can just reapply to your current estimates to give you a feel for worst-case and likelihood.

Second, slack people, slack! Lots of project managers will build slack into a project, but don’t understand it. A common mistake is that people come up with a development estimate, and they say “Ok, let’s turn this into a deadline. But we know just taking this date (as it is) is wrong, so we’ll stick some slack on the end”. Padding, it feels so good! You did a great estimate, and then you gave yourself room to move. Problem is, it’s not actually slack – the critical path is almost certainly still too tight, so it sits like a pot of free time that tends to get used up rapidly.

The correct mechanism to use slack is to sprinkle it through the project. You actually need more up-front, because that’s where the team are settling into the project and progress is most unpredictable. There should be lots of little pieces of slack, and then – critically! – if you use up any piece of slack, that causes an immediate and automatic change to the delivery date.

This is a great control point to revisit your project justification: if I’m 10% and 30 days through my project, and I’ve used all 5 days of slack available at this point, it doesn’t matter if I still have another 20 days of slack in the remainder of the project – the project is way off schedule. I expected to do this work in 25 days, not 30, so I’m over by 20%. I can expect to be 50 days late.

Lastly, agile practices. I am absolutely a subscriber to agile methodology in the right place and time, especially on those projects where the end outcome is not yet really known. It’s much easier to plan out a project where the steps are reasonably well-known; true R&D projects are nothing like that.

I still maintain that the guiding principle here is one of business value. At the point at which we don’t think we’re going to achieve value, we shut the thing down: crucial here, then, is our definition of Minimum Viable Product (if that’s the process we’re using), or the conversations with users. The first point of the agile manifesto contains the (para)phrase “deliver valuable software early“, so while we might not have a deadline for the end artefact itself (delivery when it’s ready/done), we certainly should have deadlines along the way by which point we’ve delivered some measurable value. If we don’t meet those, then again, shut the thing down.

I’m a big fan of Dan Ward’s writing since being introduced to it, and I have been particularly inspired by how he talks about deadlines. The F.I.R.E. book is excellent, and if you haven’t read the book then Dan’s “Change This” manifesto on Igniting Innovation is a good precis. I will quote this little bit, but the whole thing is a must-read:

“setting a firm deadline acts as a forcing function for creativity and helps nudge teams in new directions” – Dan Ward
Last point. I don’t think deadlines are, in themselves, a brilliant metric. As a judge of whether or not business value has been delivered, they’re exceptionally poor – anyone can ship a pile of rubbish to a deadline. Where any delivery involves significant challenge or unknown, it’s ridiculous to set a date as being bullet-proof. But all that said, the commitment of a deadline is extremely useful, and using them as a proxy for the likely delivery of value I think can be powerful.

 

Brexit confirms: storytelling is dead

This is not a post about Brexit; this is about conversations. Storytelling rose in the 80’s as a key marketing tool – phenomena like the Nescafe “Gold Blend” adverts demonstrated how the ability to tell a story could convincingly engage consumers en masse. Truth be told, this was nothing new – the “soap opera” is so-called because those ongoing serial dramas used to be sponsored by soap manufacturers. But, the key insight by the storytellers was that creating a story around a message you wanted to communicate (rather than simply being associated to or referenced by the story) was very powerful.

Now, Nescafe coffee had only a tangential bit-part within their famed serial adverts, and indeed broadcasting on television is a remarkably expensive way of telling a story – so in fact, the technique didn’t really start to take off until the early 2000s, with the advent of the internet. Of course, big names continued to tell stories in the way they had – Guiness being the more modern exemplar – but now smaller organisations could do it; they felt it built relationships with their consumers.

There is a lot to be said about discerning what is storytelling and what isn’t. Critically, a story ought to have an arc – a beginning, middle and end at least – but at a deeper level ought to have a structure which creates emotional engagement. Shakespeare was a master of the five-act structure, and most blockbuster movies to this day retain a very similar make-up. Advertisements alone do not lend themselves to that level of sophistication, but people started applying storytelling in many different areas of business – although seen as a marketing tool, it quickly leaked into sales, the boardroom, investment decks and beyond.

Many people get benefit from story-thinking without necessarily having a huge amount of structure. The process of thinking editorially about their message, and trying to frame that in the form of a story is difficult and restricting. In a similar way to writing a Tweet, the added restrictions make you think carefully about what you want to say, and it turns out these restrictions actually help rather than hinder – a message has to be much more focussed. However, those restrictions (while helpful) are not the power of storytelling – more the power of subediting / thinking (which, is seems, it less common than you’d think).

People have said before me that storytelling is dying – Berkowitz’s piece on becoming storymakers rather than tellers is well-cited. It’s a very marketing-oriented perspective, and there’s lots to agree with, but I think it’s dead wrong for digital-native organisations.

Politics is an awful lot like marketing and product development in some key ways; in many ways, it actually resembles the market before software-as-a-service:

  • highly transactional nature (votes instead of money)
  • very seasonal sales periods, often years between sales
  • competitive marketplace for a commodity product
  • repeat customers very valuable, but profit function dependent on making new sales on the current product line

Not just that, but crucial is the engagement of the “customer” (the voter) in an ongoing fashion, to ensure that the party is developing policies that they believe will be voted for. Interestingly, in blind tests, the Liberal Democrat and Green policies rate very highly – so we can see that while the product is important, market positioning is critical to ensure customers have a specific formed belief about your product.

Within continuous delivery thinking, the digital organisation is concerned primarily with conversations to drive the brand rather than positional or story-oriented marketing. However, what was particularly interesting with the Brexit debate: this conversational engagement was writ large across the whole leave campaign.

Things we can note about the campaign:

  • meaningful engagement on social platforms like Facebook and Twitter. Of course, campaigns have done this before (Corbyn would be another example), but while others have been successful at deploying their message, Leave were highly successful in modifying their conversations quickly
  • a stunningly short period of campaigning. Who knows why this happened: the Scottish referendum on independence was over a period of 18 months. The UK Brexit debate was complete in 4. There was no way a campaign could hammer home messages; each thing they said had to be well-chosen and timely
  • absolute control over the conversation. While Leave conversed freely with their own supporters, they meaningfully achieve air superiority in terms of the conversation in the debate. Their messages were the ones discussed; they created the national conversation. People are shocked by how “untrue” many of their statements were: but people can recall them readily. I doubt many could recall anything Remain said other than vague threats about the economy.

The speed of the conversation here was crucial. They adapted in a truly agile fashion, and were able to execute their OODA loop significantly more quickly. In the end, it was a tight contest, but it really should not have been.

Storytelling is a blunt instrument in comparison. It’s unresponsive, it’s broadcast, and it’s not digital native. Its time is up.

Containing incestuousness

Having droned on a little the other day about duplication in Stackanetes (in hindsight, I had intended to make a “it’s turtles all the way down” type jibe), I’ve been delighted to read lots of other people spouting the same opinion – nothing quite so gratifying as confirmation bias.

Massimo has it absolutely right when he describes container scheduling as an incestuous orgy (actually, he didn’t, I just did, but I think that was roughly his point). What is most specifically obvious is the fact that while there is a lot of duplication, there isn’t much agreement about the hierarchy of abstraction: a number of projects have started laying claim to be the lowest level above containers.

It comes back to this; deploying PaaS (such as Cloudfoundry, which I try to hard to like but seems to end up disappointing) is still way too hard. Even deploying IaaS is too hard – the OpenStack distros are still a complete mess. But while the higher level abstractions are fighting it out for attention, the people writing tools at a lower level are busy making little incremental improvements and trying to subsume new functionality – witness Docker Swarm – they’re spreading out horizontally instead of doing one thing well and creating a platform.

I don’t think it’s going to take five years to sort out, but I also don’t think the winner is playing the game yet. Someone is going to come along and make this stuff simple, and they’re going to spread like wildfire when they do it.

Stackanetes

There’s a great demo from the recent OpenStack Summit (wish I had been there):

OpenStack is a known massive pain to get up and running, and having it in a reasonable set of containers that might be used to deploy it by default is really interesting to see. This is available in Quay as Stackanetes, which is a pretty awful name (as is Stackenetes, and Stackernetes, both of which were googlewhacks earlier today) for some great work.

I’m entirely convinced that I would never actually run anything like this in production for most conceivable workloads because there’s so much duplication going on, but for those people with a specific need to make AWS-like capability available as a service within their organisation (who are you people?!) this makes a lot of sense.

I can’t help but feel there is a large amount of simplification in this space coming, though. While “Google for everyone else” is an interesting challenge, the truth is that everyone else is nothing like Google, and most challenges people face are actually relatively mundane:

  • how do I persist storage across physical hosts in a way that is relatively ACID?
  • how do I start resilient applications and distribute their configuration appropriately?
  • how do I implement my various ongoing administrative processes?

This is why I’m a big fan of projects like Deis for businesses operating up to quite substantial levels of scale: enforcing some very specific patterns on the application, as far as possible, is vastly preferable to maintaining a platform that has to encompass large amounts of functionality to support its applications. Every additional service and configuration is another thing to go wrong, and while things can made pretty bullet-proof, overall you have to expect the failure rate to increase (this is just a mathematical truth).

CoreOS in many ways is such a simplification: universal adoption of cloud-config, opinion about systemd and etcd, for example. And while we’re not going to go all the way back to Mosix-like visions of cluster computing, it seems clear that many existing OS-level services are actually going to become cluster-level services by default – logging being a really obvious one – and that even at scale, OpenStack-type solutions are much more complicated than you actually want.

 

 

Some notes on Serverless design: “macro-function oriented architecture”

Over the past couple of days I’ve been engaged in a Twitter discussion about serverless. The trigger for this was Paul Johnston‘s rather excellent series of posts on his experiences with serverless, wrapped up in this decent overview.

First, what is serverless? You can go over and read Paul’s explanation; my take is that there isn’t really a great definition for this yet. Amazon’s Lambda is the canonical implementation, and as the name kind of gives away, it’s very much a function-oriented environment: there are no EC2 instances to manage or anything like that, you upload some code and that code is executed on reception of an event – then you just pay for the compute time used.

This is the “compute as a utility” concept taken more or less to its ultimate extreme: the problem that Amazon (and the others of that ilk) have in terms of provisioning sufficient compute is relatively well-known, and the price of EC2 is artificially quite high compared to where they would likely want to go: there just is not enough supply. The “end of Moore’s law” is partly to blame; we’re still building software like compute power is doubling every 18 months, and it just isn’t.

Fundamentally, efficiency is increasingly the name of the game, and in particular how to get hardware running more at capacity. There are plenty of EC2 instances around doing very little, there are plenty doing way too much (noisy neighbour syndrome), and what Amazon have figured out is that they’re in a pretty decent place to be able to schedule this workload, so long as they can break it down into the right unit.

This is where serverless comes in. I think that’s a poor name for it, because the lack of server management is a principle benefit, but it’s a side-effect. I would probably prefer macro-function oriented architecture, as a similar but distinct practice to micro-service oriented architecture. Microservices have given rise to discovery and scheduling systems, like Zookeeper and Kubernetes, and this form of thinking is probably primarily responsible for the popularity of Docker. Breaking monolithic designs into modular services, ensuring that they are loosely coupled with well-documented network-oriented APIs, is an entirely sensible practice and not in some small part responsible for the overall success Amazon have had following the famous Bezos edict.

Macrofunction and microservice architectures share many similarities; there is a hard limit on the appropriate scale of each function or service, and the limitation of both resources and capability for each feels like a restriction, but is actually a benefit: with the restrictions in place, more assumptions about the behaviour and requirement of such software can be made, and with more assumptions follow more powerful deployment practices – such as Docker. Indeed, Amazon Lambda can scale your macrofunction significantly – frankly, if you design the thing right, you don’t have to worry about scaling ever again.

However, one weakness Paul has rightly spotted is that this is early days: good practice is really yet to be defined, bad practice is inevitable and difficult to avoid, and people attempting to get the benefits now are also having to figure out the pain points.

It’s worth saying that this architecture will not be for everyone – in fact, if you don’t have some kind of request/response to hook into, frankly it won’t work at all – you’re going to find it very difficult to develop a VPN or other long-lived network service in this environment.

Many of the patterns that should be applied in this scenario are familiar to the twelve-factor aficionado. Functions should be written to be stateless, with persistent data discovered and recorded in other services; configuration is passed in externally; et cetera. Interestingly, no native code is supported – I suggest this is no surprise, given Amazon’s investment in Annapurna and their ARM server line. So, interpreted languages only.

A lot of this reminds me a lot of the under-rated and largely unknown PHP framework, Photon. While this is not immediately obvious – Photon’s raison d’etre is more about being able to run long-lived server processes, which is diametrically different to Lambda – the fundamental requirement to treat code as event-driven and the resulting architecture is very similar. In fact, it was very surprising to me that it doesn’t seem to be possible to subscribe a Lambda handler to an SQS topic – it’s possible to hack this via SNS or polling, but there is no apparent direct mechanism.

It’s difficult to disagree that this is the wave of the future: needing to manage fewer resources makes a lot of sense, being able to forget about security updates and the like is also a major win. It also seems unlikely to me that a Lambda-oriented architecture, if developed in the right way, could ever be much more expensive than a traditional one – and ought to be a lot less in practice.