Everyone wants to go faster; this has always been true, and it is especially true in the context of LLM deployment. Teams using AI in development are not just using it to write code more quickly, but, being frank, AI has not yet demonstrated that it is good at many of the other tasks. Teams therefore have to decide where human attention buys the most safety and quality for the least overhead and the most speed.

We know that AI does not make all teams faster, especially over the longer term. Better baseline software development practices appear to make for stronger foundations when deploying AI. However, it’s probably also not true to say that just improving the SDLC as performed by humans automatically makes things better for AI.

I am not trying to settle a philosophy debate about AI autonomy. The goal here is narrower: how to place human decision points in an SDLC that now includes AI-generated analysis, code, and tests.

A quick definition baseline

Before talking about process, it helps to define the modes clearly:

  1. Human in the loop: AI proposes, a person approves or edits before merge/deploy.
  2. Human on the loop: AI runs mostly unattended, with humans monitoring signals and intervening on exceptions.
  3. Human out of the loop: AI takes actions end-to-end with no required human checkpoints.

Most teams are trying out 1, but lots of people are talking about 3. There is considerable fear of missing out here - no one wants to be behind on AI deployment. At times, the discourse implies we are no longer developers but operators of software factories, each with dozens of autonomous agents.

It strikes me that the latter, at least for now, is a pipe dream. Remember the 4 Hour Work Week? For a while back in the 2010s, it was all the rage, and anyone who was not outsourcing their day-to-day work to some contractors off Fiverr was missing the brave new world.

In this discussion, I intend to focus on the real outcomes instead. Where is performance now, where will it be, and where should teams be investing? How do we work out what experiments are worth running?

Human-in-the-loop vs human-out-of-the-loop in practice

The real fault line today is not whether AI can generate useful output. It is the division of labour between people and systems; this is where the most debate is to be found.

In human-in-the-loop (HITL) setups, developers remain deeply involved. AI assists with drafting code, generating tests, summarising logs, and suggesting refactors, but engineers still decide what ships. This is how most teams currently work with tools such as Claude Code-type assistants: faster delivery, but human review and judgement remain mandatory.

If there is a downside with HITL, it is that we are simply replacing personal tasks with AI. We are asking models and agents to perform the same work, in the same way, without really thinking about whether they are capable of this - and, unfortunately, performance is variable.

Human-out-of-the-loop visions are more radical. In this model, autonomous agents would take a project from objective to delivery with minimal intervention, while people provide high-level goals or final oversight only.

The HOTL direction is actively researched, and some entrepreneurs present compelling demos. But there is no credible evidence that fully autonomous, end-to-end software delivery is reliable in ordinary production contexts today. Even forward-leaning papers generally keep humans in critical roles.

So I think the initial conclusion here is relatively straightforward: we cannot simply drop AI into existing SDLC processes and expect radical improvement. We have to think about how it works, where it adds value, and redesign our processes accordingly. However, neither can we simply attempt to hop directly to a fully autonomous process: it is a nice vision, but that is not what today’s tools are capable of.

Why this is an SDLC design question, not just a tooling choice

Teams often start by evaluating model quality. That matters, but SDLC outcomes usually fail on control and process design, not raw generation quality. The common failure pattern looks like this:

  1. AI boosts throughput in experiments that feature low-risk work.
  2. The same pipeline is promoted and used for higher-risk changes.
  3. Review surfaces are too shallow to catch subtle regressions.
  4. Errors and problems accumulate, more quickly over time, and we effectively have a brand new type of technical debt.

The issue is not that AI wrote the code. The issue is that teams often optimise for initial output and underinvest in the long-term maintenance burden.

How to decide what is right for your team

A useful heuristic is to map each SDLC stage against four factors:

  1. Blast radius: If this step fails, what breaks and how widely?
  2. Detectability: How quickly will you know something is wrong?
  3. Reversibility: Can you safely roll back in minutes?
  4. Explainability: Can the team explain why the AI made this change?

If blast radius is high, detectability is poor, reversibility is slow, or explainability is weak, keep a human checkpoint. If all four are favourable, you can often move to on-the-loop monitoring. I think this is the key reason why teams who are already high performers find that AI is a net strong accelerator.

One additional test helps in AI-heavy pipelines: evidence quality. Ask whether your confidence comes from repeatable production outcomes or controlled demos. If it is mostly demo-based, treat autonomy claims as provisional and keep humans closer to the decision boundary. Evidence continues to show that even when we think initial performance is good enough, it does not necessarily remain high. We need longitudinal data to confirm this.

A practical rollout sequence

If you are deciding now, this sequence is usually lower-risk than a big-bang shift:

  1. Start in-the-loop on coding and test generation.
  2. Instrument quality, review time, rollback events, and escaped defects.
  3. Promote specific low-risk workflows to on-the-loop.
  4. Reserve out-of-the-loop for tightly scoped, reversible tasks only.
  5. Reassess quarterly, because both model behaviour and team capability change.

This keeps the organisation learning while avoiding the false choice between full manual control and full autonomy. A critical part of this process is defining how you will identify and measure the failure modes introduced by the new workflow.

One specific anti-pattern I am already seeing is this: the need to instrument more deeply is misunderstood as a delaying tactic to slow down AI adoption. This is not the case, although there is considerable fear of missing out in this area, and some do not feel the need to move deliberately. From my perspective, it is impossible to judge long-term effectiveness without this data.

Closing rule of thumb

The best role for humans in an AI-enabled SDLC is not to approve everything forever, and not to disappear. It is to own the decision boundaries where the cost of being wrong is highest. The worst-case scenario, which I think leads to significant burnout, is for developers to get nagged every five minutes for the latest bit of agent input.

If you are unsure where to start, keep humans in the loop for architecture-impacting or customer-visible changes, and move repetitive low-blast-radius work to monitored automation. That approach is rarely perfect, but it is usually robust. Separating tasks by AI risk is a good step: adoption is faster and more resilient when delegated tasks are completed reliably.

The medium-term destination for most teams is likely not hands-off development. It is selective autonomy: more AI execution, paired with clearer human ownership of goals, constraints, and final accountability.