No doubt many people will have read the story about how an error in a piece of software has prevented a number of women being invited to a standard screening. The current estimate is that this could have led to as many as 270 lives being lost or curtailed, although it will be difficult to say for some time. As a ex-CTO in a healthcare business, this is the type of problem that used to keep me awake at nights – a small mistake leading to tragic results. How did this happen?

The Nature of Code

Let’s take a look at the description of the problem published so far: “a computer algorithm failure was to blame, which meant women who had just turned 70 were not sent an invitation for a final breast scan as they should have been”.

For those who don’t know, the screening programme is based on eligibility criteria. Within the NHS, care pathways are developed in conjunction with the evidence from NICE about which practices are clinically effective. In this case, NICE say it’s worth doing up to the age of 70, which seems a bit arbitrary: but, there are negatives to having the screening (chance of false positive, chance of radiation-induced cancer, etc.) which lead them to conclude it’s not effective. Therefore, between the ages of 50 and 70, women are offered seven screens (actually, there’s currently a trial which extends this to ages 47-73, but we’ll talk about that again shortly).

The programmers in the audience already know what I’m going to say next. We have two conditions here in which someone leaves the programme of screening: when they turn older than 70, or they complete seven scans. The software had a logic error in it which removed patients from the programme too soon.

If I were to bet, it would be the test for the age were wrong, and that patients were removed if they were 70, rather than 71. The news suggests it only affected final scans too, so possibly there was some combination-of-logic error, but given the scans are every 3 years, it would only be a small number of 70 year olds who would become eligible for a seventh scan.

For the non-programmers in the audience: this is about the simplest type of bug you can get in software, technically. It’s obvious and easy to fix, given the right conditions. However, it’s insidious and pervasive, and computer “science” still doesn’t have adequate tools to routinely prevent such errors – so we rely a lot on automated testing to detect them.

UPDATE: It turns out that guess was wrong, and what went wrong was more insideous. The Guardian article offers more details: as part of the test to extend the age range, the software cancelled the final test and put patients either on the extended screen or the normal practice. Essentially, an A/B test, so that clinicians could detect any difference in outcomes. Cancelling the final test makes a weird sense – you don’t want to tell people on the extended version that they’re getting a final screen when it isn’t – but they apparently failed to re-book patients on the “control”, or standard, scheme. This is a more complex logic bug, but a logic bug all the same, and all of the rest of my commentary applies.

Ethics in development

Virtually every computer science graduate is familiar with the fateful Therac 25 device. Short story: a chemotherapy device, the Therac was used to deliver radiation to cancer suffers. In the mid-eighties, a handful of patients were killed through over-exposure, and software errors were the root cause.

The fault in the case of the Therac 25 was substantially more complex than the one here, was much more difficult to reproduce, yet affected many fewer people. As students, we study this case because it demonstrates how our field of engineering is inevitably bound with ethical concerns – about how we design and develop software, and fundamentally how we protect human life by taking an appropriate approach when creating software.

Many people recognise applications of software where an error could lead to a loss of life: aeroplane autopilots are a good case in point. Over time, we have improved regulation and raised safety expectations of such products to ensure they fail very rarely.

Increasingly, all sorts of equipment – including that used in healthcare – is computerised, and therefore should be treated in a similar way. In the EU, most of what is considered “reasonable” in this field is now codified in law – it’s called the Medical Devices legislation (I only mention the EU system as it’s the one I’m familiar with; most jurisdictions have an equivalent).

We won’t know how many people this actually affected until they contact everyone. Jeremy Hunt seems to have a predisposition to make the NHS appear worse than it is, for reasons I don’t understand, but equally it would be reasonable to give the upper-bound as far as possible – better to give good news later rather than more bad.

Looking at the numbers, given the NICE page I linked previously, the appropriate cancer detection rate is around 14.5 per 1000 – the detection rate at age 70 doubles compared to the rate at 50. I suspect the calculation to arrive at the 270 figure is complicated by various facts – there’s a smaller evidence base for partial screens. The overall stat is one death prevented per 235 invited to screening; however, natural mortality at age 70 figures heavily to reduce that from what would be a headline of 2,000 affected.

Meta-medical Devices

Unfortunately, it’s highly likely that the software that went wrong here was a medical device. There are a bunch of different exclusions that mean an application would be exempt from the legislation. For example, using Microsoft Excel in a medical environment is permitted, but it’s general-purpose software and certainly not a medical device.

There’s a bigger get-out clause for administrative software:

We don’t know what application went wrong here, or who wrote it. Conceivably, you could write such a system in Microsoft Access, and it would work fine – except for such logical coding errors as the one that occurred.

Medical device legislation is complex, and the compliance aspect of it is prohibitively expensive for any sophisticated application. Therefore, most people try to avoid it, unless they absolutely need to go through it.

If the software had been a medical device, would it have gone wrong in this way? Who knows; I would guess yes, but I don’t think this regulation is the appropriate answer.

What we need is a new classification of software, which would encompass applications that are used for communications and administration. Software must not be able to harm patients either directly or indirectly, through action or inaction. We don’t need the medical devices sledgehammer, but equally, some kind of quality process should be evident.

The problem here isn’t that there was a bug, per se. Bugs are relatively inevitable, even simple ones like this. What is more inexcusable is that this problem wasn’t detected for almost ten years. As we saw in the Therac 25 case; where there is potential for harm, there should be independent checks to ensure that harm doesn’t occur.

This should be yet another warning to developers in our industry to think deeply about the effects their decisions have, when building software. It should also be a stark reminder to business leaders that sometimes you need to put significant effort into assuring software works correctly.