Alex Hudson

Thoughts on Technology, Product, & Strategy

Software architecture is failing

I doubt there has ever been a time when software architecture was seen as a raging success. The “three-tier architecture” of the web has held up extremely well and is an excellent place for many people to start. The “12 Factor App” approach has encouraged developers to adopt practices that make deployment and scaling much simpler. Over the last couple of years, though, I’ve noticed developers advocating for architectures I consider to be extreme and limited in utility, foisting highly complex systems into startup environments at great cost. It appears to me to be getting worse.

First, I should say I’m going to talk about a few specific patterns and technologies in this post. I’m not against any of them; I don’t consider any of them to be bad ideas. I do think that some of them are limited in their applicability, though. I don’t believe in silver bullets, and I think architecture is only “good” in the context of the problem it tries to solve – I think it’s more important to design for adaptability (I’ve called this Pokémon Architecture in the past).

Through various conversations with both different technical teams and business owners over the last year (none I’m going to name!), I keep hearing about the same types of problems again and again. “We’re not delivering quickly enough!”. “Our systems are too complex to maintain!”. “The application we delivered last year is completely legacy now but it’s too difficult to replace!”.

Talking about how technical teams make decisions, I often see a complete lack of understanding how to relate the business issues their organisation faces to the technical strategy. Regularly, I don’t see a technical strategy at all. The team may have made a decision like “We’re 100% microservices!”, but when I ask them they cannot give a good reason that relates to the business in a direct way.

Mistakes I’m seeing

To give an flavour of some specific problems I’ve seen:

  • CQRS. I think this is the one I’m noticing most currently, because it’s everywhere right now. I saw an article on Twitter the other day about “How to develop your MVP with CQRS”. For those who don’t know, CQRS is essentially a RPC pattern that enables you to deal with highly complex and volatile data stores – most good introductory material on it will say “This is appropriate for large/complex systems”.
    A lot of dewey-eyed write-ups have proclaimed it a good solution to deal with transactions across microservices (it’s not, really), most of the push I see for it is developers who don’t like ORMs.
    This latter category is particularly pernicious – the end result is a huge increase in the amount of code being written, a decrease in the DRYness of the code and a system that is much more difficult to reason about, but no obvious benefits to offset these problems.
    I’ve seen a few systems now using a CQRS approach when a standard CRUD approach works fine. “Why?”, I ask. They burble about “responsibility” or “different domains”, I haven’t yet had the “we tried both approaches and this one works better because <business reason>” response.
  • Event Sourcing. Related to the above, I guess, as they are often used in combination. Event Sourcing says you have an immutable log of events, and use that log to create an eventually-consistent view of your application – rather than saving state in an RDBS or something.
    This is a classic “we didn’t consider business requirements” type technical choice. I’ve seen two different start-ups now, who hold personal data about customers in their “immutable log”. “How are you planning to handle GDPR requirements and removal of data?” – turns out the answer is often “Er – we haven’t thought about that.” Cue a sad face when I tell them that if they don’t modify their immutable log they’re automatically out of compliance.
  • Local storage. A tech team decided that everything needed to be a Single-Page Application on the front-end, and needed to handle offline capability – pretty reasonable. However, a good bunch of the data they were processing was extremely sensitive, and by running everything through their SPA framework, they were effectively distributing that sensitive content across a variety of browser caches. Whoops. Difficult decision to reverse out of at a late stage.

More generally, I see little re-use of existing technology and/or a desire to use a lot of “new shiny” for very specious reasons. The numbers of developers working at an inappropriately low level is frightening.

What are we doing wrong?

I place the blame on technical leaders like myself. For those in tech who are not working at Facebook/Google/Amazon, we’re simply not talking enough about what systems at smaller enterprises look like. We’re not talking about what is successful, what works well, and what patterns others might like to copy.

A lot of technical write-ups focus on scaling, performance and large-scale systems. It’s definitely interesting to see what problems Netflix have, and how they respond to them. It’s important to understand why Google take decisions in the way they do. However, most of their problems don’t apply to anyone else, and therefore many of the solutions may or may not be appropriate.

In this view of the world, it’s easy to see why teams focus on the solutions these companies talk about. They want to get ahead of the scaling curve. They want to deliver more robust services. They don’t want legacy.

I recall one specific conversation with an engineer about one of their core services. “It’s totally legacy, and no-one maintains it – it just sits there working, except for the occasions it doesn’t. The problem is replacing it is so hard, it’s got great performance, and the business doesn’t want to spend time replacing something working”. This is the problem being ahead of the curve – the definition of “success” (it works great, it’s reliable, it’s performant, we don’t need to think about it) looks a hell of a lot like the definition of “legacy”.

A well-articulated technical vision should match the business vision directly, especially in a tech company. There are core parts of the technology that deliver most of the value / differentiator, and these are important to get right. There’s usually then a bunch of other software and services which is more like scaffolding; you have it around in order to get stuff done.

What should we do?

I think we’re often getting the build/buy decision wrong. Software development should be the tool of last resort: “we’re building this because it doesn’t exist in the form we need it”. I want to hear from more tech leaders about how they solved a problem without building the software, and tactics for avoiding development.

I think we worry too much about the future. We say “You aren’t gonna need it!” but we don’t live the YAGNI values. We talk too much about technical debt, and in the meantime complain that agile is “failing”. I want to hear more about projects that deferred decisions and put off architecting until much later in the process.

I want to hear more about delivery at real speed. Small pieces of software that are not necessarily interesting but deliver business value are the real heroes in our industry, and the developers who create them the real stars. To paraphrase Dan Ward, we can only deliver at high pace if we get started early – we means re-using existing work. “Mash-up” shouldn’t be a dirty hackfest concept.

I especially want to hear more about developers working with systems that have constraints. Too often I hear “We were using <product X|framework Y> and it had this one specific problem, so we solved it by rolling our own!”. Well, that’s great – and it might be the right choice. But it’s not the only one – it’s the proverbial sledgehammer to crack a nut.

I believe that if you don’t encounter the constraints of the layers underneath you, you’re not working at a high enough level. Developers should occasionally bump into obvious stupid stuff that the layer beneath causes.I want to hear from people pushing standard stuff beyond its limits. I think we grossly underestimate what off-the-shelf systems can do, and grossly overestimate the capabilities of the things we develop ourselves.It’s time to talk much more about real-world, practical, medium-enterprise software architecture.

Postscript: you may be interested in reading about some of the reaction this post generated.

Previous

Improving the metaphor of technical debt

Next

WPA2: Broken with KRACK. What now?

3 Comments

  1. The problem is not in these particular patterns and it has very little to do with architecture. Architecture is a combination of higher-level governance, set of non-functional requirements and the power of teaching/mentoring/helping people. CQRS and EventSourcing is not architecture.

    Cargo culting, at the other hand, is a problem, among both developers and architects. “New tech” trend is endless and CV-driven development strives. However, as Alberto Brandolini said – if developers do not have business challenges, they will entertain their brain with technological challenges.

    Developers keep working in a solution space and they are being fed with “requirements” coming from elsewhere, leaving them almost no space to influence how systems work in terms of business. They only choice they are left with is playing with tech. This could be either new frameworks, new tools, or new (or newly discovered) patterns.

    The failure is not in architecture per se, but in the way business sees developers and vice versa. I actually want to deliver this message more often on conferences, but, apparently, tech conferences are more interested in selling “hype” tech.

    As per event-sourcing and CQRS, I think the argument of “you are not Google” is a straw man. You don’t need to be Google to use these patterns. Exactly the opposite, when you know little about your domain, using events is a huge benefit. Event looking at your past event schemas can help you seeing the progress of the domain knowledge development. In what you call “CRUD works fine” systems you only have state schemas, which have a tendency to be expanded based on current needs, leading to SRP violation and DRY on the database level. From there it is a straight road to hell (sorry, I should be saying big ball of mud). I am always surprised when people say “let’s get back to old ways” like we do not see that virtually every system built in the past years has became a big ball of mud. It is very naive to think this was because developers were any different then from who they are today. Same people, best intentions, ignorance to the business and future of the systems they build, that’s all. Why you think this will work better now?

  2. Niklas Lochschmidt

    Interesting read. I have worked on a CQRS/ES system for a couple of years. I wholeheartedly agree with you that the amount of software needed should be kept as small as possible and especially building custom frameworks should never be an immediate choice.

    However, I wonder which CQRS/ES projects you worked in, because your description “CQRS is essentially a RPC pattern that enables you to deal with highly complex and volatile data stores”, does not seem right to me. Like Martin Fowler writes[1] “At its heart is the notion that you can use a different model to update information than the model you use to read information”. He also immediately follows up with a warning: “beware that for most systems CQRS adds risky complexity”. RPC may be used together with CQRS but not necessarily. The system might as well consum files and react to GraphQL queries on the read side. It’s about building different models for command and query side. Agreed, if you don’t need different models, CQRS adds a lot of unnecessary complication, but please don’t add to the confusion around it (other people have proclaimed CQRS is a framework. It’s not, it’s an architectural pattern).

    We actually had very distinct business reasons for using ES. We needed a complete audit log of the core system and we wanted to be able to analyze certain metrics from the event stream after the fact. We actually made use of that quite often and also had multiple very successful system refactorings, that would probably have been much harder in a purely state driven system. It is really a different thing when you know you have captured the intent of the user instead of only the result of her actions. That said, I would say it only ever makes sense in your core domain, the place where you really want to know what is going on. Dealing with GDPR is an issue if not taken into account from the start, however so is dealing with requests for erasure when you have database backups and WAL archives containing that information, correct?

    Again, I agree with the notion but please care for the details 😉

    [1]: https://www.martinfowler.com/bliki/CQRS.html

  3. Alex

    That’s a great, considered, comment – thank you.

    I was trying to sum up CQRS in a simple way for non-developers: it felt to me that focussing on the command-style of the pattern was more important, hence likening it to RPC. It’s clearly very different to REST, where you operate on resources, rather than issue commands per se. But I take your point; I have examples of other patterns being problematic as well, and the fact this example was CQRS was just because I’ve seen a lot of people trying CQRS (and failing, sadly) recently.

    I’m going to write up my experiences with both CQRS and ES another time, because I think that’s a separate conversation, and a lot more difficult to get the tone right. To me, it’s a little bit like the discussion about functional languages versus procedural: there are clear technical benefits to a LISP-like, and people comfortable in that environment are extremely productive. But, it’s not for everyone (or maybe even the majority) – and I think it’s interesting to explore why, because it’s not a fault of the tech.

    Totally agree about ES. A great example from my past would be using it for medical records: a radiology report, for example, is actually a direct ES example even though they were originally done on paper. For time-oriented processing it’s obviously a reasonable choice. However, I would still argue that you need a good reason to choose a system like that, and I think a lot of developers have trouble reasoning about eventual consistency (in all forms – from using ES to using AWS DynamoDB).

    GDPR probably also deserves a post on its own. Backups / WAL archives are an issue, but a different kind of issue to be sure – it’s a very complex subject, unfortunately. The issue I was identifying specifically was retaining PII in a live system, even if “beyond use” from a software perspective.

Leave a Reply