Azure has never been the #1 cloud provider - that spot continues to belong to AWS, which is the category leader. However, in most people’s minds, it has been a pretty reasonable #2, and while not necessarily vastly differentiated from AWS there are enough things to write home about.

However, even as a user and somewhat of a fan of the Azure technology, it is proving increasing difficult to recommend.

Now, I’m not Gartner (who, as of July 2021, rated Azure as a category “Leader” with only AWS and Google as competition), so who really cares what I think? But, as a CTO who is constantly evaluating technology like this, it’s worth talking about the things Azure is currently getting wrong.

I use Azure in production with a number of clients, for a variety of purposes. The user authentication story has always been particularly strong, as has the ability to run Windows desktop and server workloads (if that’s your thing - which, for some, it still is). Azure is sometimes cheaper than AWS, sometimes offers more/better things, and in some areas (like CosmosDB) is genuinely quite a different and intriguing offering.

I’m also a huge fan of Azure devrel. I’ve posted to Twitter about a variety of things - from Code Functions to Static Websites - and inevitably, there are engaging and knowledgable people who pop up from Azure who know what they’re talking about. Lots of Azure code is in the open, on GitHub, and it’s typically easy to peek under the covers to see what’s up.

So, lots of positive things to say. But, there are increasing negatives, and it’s worth calling them out specifically.

Security

Yeah, you knew I was going to go there. In the last few weeks, Azure has suffered some truly catastrophic security incidents:

  • CosmosDB’s Jupyter integration turned into a remote “access my database” vuln which was quickly addressed but affected a number of unlucky users. It doesn’t look like this was taken advantage of in the wild, but in principle allowed complete DB access.
  • Azure Container Instances were found to have a vulnerability that allowed users to access other customer’s information within ACI. Again, a limited number of customers involved, and no detection of abuse in the wild.
  • Azure VM management extensions were discovered to have a trivial authentication bypass vulnerability which affected all Linux compute VMs with the extensions installed. This is actually a set of problems, one of which alone is a remote code execution vuln (i.e. the worst sort) that scored 9.8 out of 10 on the CVSS 3.0 scale.

I think it’s great that Microsoft have been open and transparent about this - that gives me some confidence. And it’s probably not an accident that these vulnerabilities have turned up in the almost-month since Azure announced a new Security Research Challenge with associated bug bounties worth up to $60k USD (also a good move, well done Microsoft).

However, these vulnerabilities are dreadful. Azure is a managed cloud service, conformant to ISO 27001/27017, with SOC 1, 2, and 3 attestations in place, plus a boatload of other certificates too numerous to mention. Each of these pieces of paper shows Azure’s information security management system controls to be appropriate and effective.

No system is bug-free or secure. However, I do not expect that dropping an authentication header from a request bypasses the security on a management interface. I really hope some soul-searching is going on here, because Microsoft’s reputation for security has been much better over the last few years and is again at risk.

Continuous Delivery

I mention this because it’s a pet peeve. We’re all talking about continuous deployment, and infrastructure-as-code, and Azure has some interesting tools to enable this - personally, I use terraform, but each to their own.

Underneath, to use Azure, you typically need to use the az command line utility. This authenticates you to Azure, and allows you management access to practically all their APIs. It’s a wonderful, functional tool, written in Python and provided as open source.

However, if you want to install this command - beware! It’s an absolute monster, weighing in at over a gigabyte in its current incarnation. This problem has been known about since a bug was raised in 2018, and as a user back then it was nowhere near as bad - maybe a few hundred meg at that stage.

The root cause is the Azure Python APIs, which are horrendously bloated. Microsoft’s backward compatibility is legendary, of course, but what has happened in the Python API is that each incompatible change has caused an in-API code fork to occur - exploring the repo is an Inception-like experience, with each subdirectory looking much the same as the others, all alike.

Plus, in order to support these old APIs, Microsoft has taken to packaging an entire python runtime with the utility, to ensure it runs correctly. And all the python bytecode cache. The only thing missing is the kitchen sink.

So, why is this a problem? Like many of you, I run most of my software build in a container-based pipeline - typically Azure Devops or GitLab. I like small, trim container images and fast build times. Pulling down 1Gb of Python to be able to az login and then perform some basic API calls - it’s a criminal waste of time, storage, and the rest. The cost of this storage can sometimes be paid with every pipeline trigger, unless you have static workers with a consistent docker cache.

The simple fact is az-cli is virtually unusable in a Docker-like scenario, it is only getting worse, and unless something changes it will be impossible to use in a year or so. What then? Call the APIs manually ourselves? Write a custom drop-in az replacement for the few commands we need to run? Come on Azure. Do better.

Cost of Compute

Every cloud provider has their expensive “thing”. Ingress is always cheap, egress always expensive. AWS has their “Managed NAT Gateway”, after all, the sit-in-the-corner money printer that never fails.

But truly, I think nothing really compares to the Azure App Service Plan. This is a fancy name for a managed VM, which will typically run some containers or other packaged workload for you in a pretty automatic fashion.

Generously, Azure have a great free plan, and they also make developer-spec models available cheaply - no complaints here. A 1GiB Plan can be had for free if you have no expectations, and if you want a shared small environment for dev/test in the UK you’re paying £7/month. Super.

OK, I’ve developed my app, ready to push the MVP to production. I just want a small production-ready plan - maybe 1.75GiB of RAM to start with, and a bit of local storage. What’s the price? £70/month. An order of magnitude greater, what?!

It gets worse. The instance above has only one core, it’s an S1. For a two core machine with 3.5GiB RAM (not a massive spec…) you’re now talking £136/month. Wow. We’ve got an annual cost of £1,600 and we haven’t even added any persistent storage or a database or anything else!

If I want a basic Linux VM and am willing to type docker run.. on my own, I can get a similar machine spec (B2ms) for £25/month. So I’m paying 20% for the compute, and 80% for the management overhead.

Yes, I realise it’s not apples-to-apples comparison, but mostly I’m not going to want to pay for the App Service Plan. Similar comments can be made about the managed database, or the managed gateways.

What’s the competition doing? Well, an AWS ECS instance is basically just the cost of an EC2 instance. A t3.medium in London (2 vCPU, 4GiB RAM, EBS storage) is $0.0472/hr on demand as I write. Same spec for an Azure Linux ACI instance is $0.11364/hr. I’m going to be paying another $80/month for the same thing.

These little differences all add up very quickly.

Outages

Ugh, the dreaded O-word. Everyone hates downtime. And pretty much every cloud provider has downtime at some stage - it’s always DNS, right? You hope that the downtime is a single group of machines, or maybe an availability zone. The worst case scenario is some global-event system downtime that affects all services of a type.

On March 15 2021, Microsoft had a 14-hour outage that not only affected Azure, but hit Office, Teams, and Xbox Live - to name but a few services. The cause was simple: they were updating encryption keys within the authentication system, and the single key marked “don’t rotate” (i.e. leave it alone - picture the proverbial Post-It note over a lightswitch with “Don’t touch!” written on it) was unfortunately rotated.

Then again in April, DNS failed across multiple Azure regions, which led to a knock-on problem (without DNS, you usually can’t access services by name, so a DNS outage is as good as a complete outage).

Now, AWS are no paragons of virtue here. Their status board is notoriously a lovely string of green lights even during the most severe of downtimes. However, typically, their blast radius is better contained - they do not tend to experience events with such broad reach.

However, when you’re selling Azure to a client, it’s typically on the basis of security, reliability, etc. “Microsoft can run this stuff better than your staff can”, effectively. When signficant downtime is encountered, it reflects poorly on Azure and undermines the claim that cloud is better - typically, there is no decent plan B (and I don’t buy the whole “multicloud” thing).

Conclusion

I’ll still continue to use Azure, but the proposition is becoming weaker. Are Microsoft attempting to add too many products to keep up with AWS, or releasing things before they’re ready? I don’t know. Azure often has a feeling of being held together with a lot more sticky tape behind the scenes than I would want to know about.

I experience a lot of random errors on Azure. Code Functions will randomly fail to trigger until redeployed, builds pushed into Oryx will not always successfully complete. There isn’t the feeling of robustness with it that you get elsewhere. The portal is also a pain to use, but then AWS is no great shakes there either.

I would love to see Microsoft take some time out for some serious engineering on both security and availability. I would love to see Azure live up to the developer promise, and for all the use cases described in the documentation to consistently work. What are you going to do, Microsoft?