Developers Are Getting Slower With AI. Nobody Wants to Admit Why.
A rigorous study found AI makes experienced developers 19% slower. The gap between dark factories and everyone else isn't tools — it's org design.
A rigorous study found AI makes experienced developers 19% slower. The gap between dark factories and everyone else isn't tools — it's org design.
Ninety per cent of the code at Anthropic is written by Claude. StrongDM ships production software with zero human code review. And a randomised controlled trial just found that experienced developers using AI tools complete tasks 19% slower than those working without them.
Read those three facts together and something doesn't add up. Either the frontier teams are lying, the study is wrong, or the vast majority of the software industry is doing something fundamentally broken with AI. It's the third one.
In 2025, METR published one of the first rigorous randomised controlled trials on AI-assisted software development. Not a vendor survey. Not a self-reported productivity metric from engineers trying to justify their tooling budget. An actual controlled experiment with experienced open-source developers.
The results should have been front-page news in every tech publication. Instead, they were quietly circulated and then mostly ignored, because the findings were deeply inconvenient.
Experienced developers using AI coding tools completed their tasks 19% slower than developers working without them. The researchers controlled for task difficulty, developer experience, and tool familiarity. None of it mattered. AI made them slower across the board.
But here's the detail that should properly alarm you: those same developers believed AI had made them 24% faster. They weren't just wrong about the magnitude — they were wrong about the direction. They thought they were accelerating while they were actually decelerating.
This isn't a fringe finding. It maps directly onto what adoption researchers call the J-curve: when you bolt a new capability onto an existing workflow, productivity dips before it improves. The dip happens because the tool changes the workflow, but the workflow hasn't been redesigned around the tool. You're running a new engine on an old transmission. The gears grind.
Most organisations are sitting at the bottom of that J-curve right now. Many are interpreting the dip as evidence that AI coding tools don't work. They're wrong — but not in the way they think.
Matt Shapiro's framework for AI-assisted development, recently unpacked by Nate B Jones, provides the clearest taxonomy I've seen for where teams actually sit on the automation spectrum. Five levels, from autocomplete to full autonomy.
Level 1 is the fancy tab key. GitHub Copilot in its original form — the human writes the software, the AI reduces keystrokes. It's a typing accelerator, nothing more.
Level 2 is task delegation. You hand the AI a discrete, well-scoped task: write this function, build this component, refactor this module. You review everything that comes back. Shapiro estimates 90% of developers who describe themselves as "AI-native" are operating here. From what I've seen across ecommerce agencies and in-house teams, he's right.
Level 3 is where it gets genuinely interesting. The developer becomes the manager. You're not writing code and having AI help — you're directing the AI and reviewing its output at the feature level, at the PR level. The model submits pull requests for your review. Almost everyone tops out here, because Level 3 is where you hit the psychological wall of letting go of the code.
Level 4 is the developer as product manager. You write a specification, you leave, you come back hours later and check whether the tests pass. You're not reading the code anymore. You're evaluating outcomes. This requires a level of trust in both the system and your own ability to write specifications that almost nobody has developed yet.
Level 5 is the dark factory. No human writes the code. No human reviews the code. Specification goes in, working software comes out, with the lights off. Almost nobody on the planet operates here. Almost.
The gap between where vendors claim their tools operate and where teams actually sit has never been wider. When a vendor says their tool "writes code for you," they usually mean Level 1. When a startup claims "agentic software development," they usually mean Level 2 or 3. When StrongDM says code must not be written or reviewed by humans, they genuinely mean Level 5 — and they actually operate there.
StrongDM's software factory is the most thoroughly documented example of Level 5 in production. Simon Willison — one of the most careful observers in the developer tooling space — calls it "the most ambitious form of AI-assisted software development that I've seen yet."
The team is three people. Three. They've been running the factory since July 2025, built around an open-source coding agent called Tractor. The entire repository is three markdown specification files. That's the agent. The specifications describe what the software should do, and the agent reads them and builds it.
But the architectural insight that makes this work — and that most people's mental model breaks on — is the distinction between tests and scenarios. Traditional tests live inside the codebase. The AI can read them, which means the AI can optimise for passing the tests rather than building correct software. It's the same problem as teaching to the test in education: perfect scores, shallow understanding.
StrongDM's scenarios live outside the codebase. They're behavioural specifications stored separately, functioning as a holdout set — the same concept machine learning uses to prevent overfitting. The agent builds the software. The scenarios evaluate whether it works. The agent never sees the evaluation criteria. It cannot game the system.
They've also built what they call a "digital twin universe" — behavioural clones of every external service the software interacts with. Simulated Okta, simulated Jira, simulated Slack. The agents develop against these twins, running full integration scenarios without touching production systems. The output is real: 16,000 lines of Rust, 9,500 lines of Go, 700 lines of TypeScript, all shipping in production.
And then there's the metric that tells you how seriously they take it: if you haven't spent $1,000 per human engineer per day on compute, your software factory has room for improvement. That's not a joke. At that spend, AI agents run at a volume that makes meaningful software — and it's still cheaper than the engineers they'd otherwise need.
No sprints. No standups. No Jira board. They write specs and evaluate outcomes. The entire coordination layer that constitutes the operating system of a modern software organisation — the layer most engineering managers spend 60% of their time maintaining — simply doesn't exist.
GitHub Copilot has 20 million users and 42% market share among AI coding tools. Lab studies show 55% faster code completion on isolated tasks. That statistic probably features prominently in Microsoft's slide decks.
In production, the reality is brutal. Larger pull requests. Higher review costs. More security vulnerabilities introduced by generated code. Developers wrestling with how to make it work rather than having it just work. One senior engineer summarised it with surgical precision: "Copilot makes writing code cheaper, but owning it more expensive."
That sentiment isn't unique to Copilot. It's the universal experience of bolting AI onto an unreformed workflow. The organisations seeing genuine productivity gains — 25%, 30%, or more — are not the ones that installed Copilot, held a one-day seminar, and called it digital transformation. They're the ones that went back to the whiteboard and redesigned their entire development workflow: how they write specs, how they review code, what they expect from different seniority levels, how their CI/CD pipelines catch the new category of errors that AI-generated code introduces.
End-to-end process transformation. Not tool adoption. And end-to-end process transformation is hard, politically contentious, expensive, and slow. Most companies don't have the stomach for it. Which is why most companies are stuck at the bottom of the J-curve, which is why the gap between frontier teams and everyone else isn't just widening — it's accelerating.
In the UK, graduate tech roles fell 46% in 2024, with a further 53% drop projected by 2026. In the US, junior developer job postings have declined by 67%. The junior developer pipeline is collapsing, and the implications go far beyond the people who can't find entry-level work — though that's bad enough.
The career ladder in software engineering has always been an apprenticeship model wearing enterprise clothing. Juniors learn by doing: writing simple features, fixing small bugs, absorbing the codebase through immersion. Seniors review the work, catch mistakes, and mentor. Over five to seven years, the junior becomes a senior through accumulated experience.
AI breaks that model at the foundation. If AI handles simple features and small bug fixes — the work juniors learn on — where do juniors learn? If AI reviews code faster and more thoroughly than a senior doing PR review, where does mentorship happen? The career ladder isn't disappearing at the top. It's being hollowed out from underneath: seniors at the top, AI at the bottom, and a thinning middle where learning used to happen.
The cruel irony is that we need more excellent engineers than ever before. Not fewer engineers — better ones. The bar is rising toward exactly the skills that have always been hardest to develop: systems thinking, customer intuition, the ability to hold an entire product in your head and reason about how the pieces interact, the ability to write a specification clearly enough that an autonomous agent can implement it correctly.
Those skills separated great engineers from adequate ones long before AI. The difference now is that adequate is no longer a viable career position at any seniority level, because adequate is what the models do. The junior of 2026 needs the systems design understanding that was expected of a mid-level engineer in 2020 — not because entry-level work got harder, but because entry-level work got automated and what remains requires deeper judgment.
Most software organisations were designed to facilitate humans building software. Every process, every ceremony, every role exists because humans working in teams need coordination structures. Standups exist because developers on the same codebase need daily synchronisation. Sprint planning exists because humans can only hold a certain number of tasks in working memory. Code review exists because humans make mistakes that other humans catch. QA teams exist because builders can't objectively evaluate their own output.
Every one of these structures is a response to a human limitation. When the human is no longer writing the code, those structures aren't optional overhead — they're active friction.
What does sprint planning look like when implementation happens in hours, not weeks? What does code review look like when no human can meaningfully review the diff an AI produced in twenty minutes, because it's going to produce another one in twenty more? What does QA do when the AI already tested against scenarios it was never shown?
The engineering manager's value is shifting from "coordinate the team building the feature" to "define the specification clearly enough that agents build the feature." The programme manager's value is shifting from "track dependencies between human teams" to "architect the pipeline that routes specifications to the right agents." The skills that matter are moving from coordination to articulation — from making sure people row in the same direction to making sure the direction is described precisely enough that machines can execute it.
If you think this is a trivial shift, you've never tried to write a specification detailed enough for an AI agent to implement correctly. It requires rigorous systems thinking that most organisations have never needed from most of their people, because the humans on the other end of the spec could fill in gaps with judgment, context, and a Slack message saying "did you mean X or Y?" Machines don't have that layer. They build what you described. If what you described was ambiguous, you get software that reflects the ambiguity.
The AI-native startups are showing what a restructured org actually looks like in financial terms — and the numbers are staggering.
Cursor, the AI-native code editor, is past half a billion dollars in annual recurring revenue with a few dozen employees. That's roughly $3.5 million in revenue per employee. The average SaaS company generates $600,000 per employee. Midjourney tells a similar story: half a billion in revenue with around a hundred people. Lovable is well into multi-hundred-million ARR with a team that's scaling but remains a fraction of what those revenues would traditionally require.
The top AI-native startups are averaging north of $3 million in revenue per employee — five to six times the SaaS average. This is happening often enough that it's not an outlier. It's the template.
What does a company generating $100 million a year with 15 people actually look like? It doesn't have a traditional engineering team, a traditional product team, a QA team, or a DevOps team. It looks like a small group of people who are exceptionally good at understanding what users need, translating that into clear specifications, and directing AI systems that handle implementation. The org chart is flat. The layers of coordination that exist to manage hundreds of engineers can be deleted when the engineering is done by agents.
For agencies and ecommerce businesses watching from the outside, this isn't abstract futurism. It's an existential competitive threat. A three-person team that can ship production-grade software at the rate of a fifty-person team doesn't just change staffing models — it changes pricing, delivery timelines, and the fundamental economics of who can compete for which contracts.
Everything above assumes you're building from scratch. Most of the software economy isn't. The vast majority of enterprise software is brownfield: existing systems accumulated over years, running in production, carrying real revenue, with configuration knowledge that exists in the heads of the three people who've been at the company long enough to remember why that one environment variable is set to that particular value.
You cannot dark-factory your way through a legacy system. The specification for it doesn't exist. The tests, if there are any, cover 30% of the codebase. The other 70% runs on institutional knowledge and tribal lore and someone who shows up in an Apollo shirt and knows where all the skeletons are buried.
For most organisations, the path forward starts not with "deploy an agent that writes code" but with "develop a specification for what your existing software actually does." That reverse-engineering work — extracting the implicit knowledge embedded in a running system — is deeply, irreducibly human. It requires the engineer who knows why the billing module has that one edge case for Canadian customers. The architect who remembers which microservice was carved out of the monolith under duress during the 2021 outage. The product person who can explain what the software actually does for users versus what the PRD says.
Domain expertise. Ruthless honesty. Customer understanding. Systems thinking. Exactly the human capabilities that matter more in the dark factory era, not less.
We've never found a ceiling on demand for software. Every time the cost of computing dropped — mainframes to PCs, PCs to cloud, cloud to serverless — the total amount of software the world produced didn't stay flat. It exploded. New categories that were economically impossible at the old cost structure became viable, then ubiquitous, then essential.
We're now dropping the cost of software production by an order of magnitude. The addressable market is expanding, not contracting. A custom inventory system that would have cost half a million and taken eighteen months can now be specified and built in weeks. Patient portal integrations that were out of reach for smaller healthcare providers are becoming feasible. The unmet demand for software across every industry vertical is enormous, and it's suddenly becoming addressable.
But — and this matters — saying "the market is getting bigger" doesn't fix the career disruption for the QA engineer whose manual test passes are being automated, or the engineering manager whose coordination role is evaporating, or the junior developer who can't find an entry-level position because the entry-level work is being handled by models that cost a fraction of a salary.
The honest answer is uncomfortable. The dark factory is real. It's not hype. Three-person teams are producing software that ships to real users, and that software improves with every model generation. The tools are building themselves — Claude Code was 90% built by Claude Code, Codex was instrumental in creating its successor — and the feedback loop is closed. The frontier is accelerating.
Meanwhile, most organisations are stuck at Level 2, getting measurably slower with AI tools they believe are making them faster. The gap between frontier and mainstream isn't a technology gap. It's a people gap. A culture gap. An organisational gap. A willingness-to-change gap that no tool and no vendor can close.
The enterprises and agencies that cross this distance won't be the ones that buy the best coding tool. They'll be the ones that do the very hard, very slow, very unglamorous work of documenting what their systems actually do, rebuilding their org charts around judgment instead of coordination, and investing in people who understand systems and customers deeply enough to direct machines to build something that should be built.
The constraint has moved. It's no longer "can we build it." It's "should we build it." And "should we build it" has always been the harder question. It just used to be hidden behind the difficulty of building it at all.
The dark factory didn't remove the need for great engineers. It stripped away the camouflage that let adequate ones hide. We're all about to find out how good we actually are at building software — and for a lot of organisations, that reckoning is going to sting.