The 11-Year-Old Bug Hidden Inside Every AI Model You Use

Every AI model — ChatGPT, Claude, Gemini — runs on plumbing nobody questioned for over a decade. A Chinese lab just found the flaw.

40 min read

40 min read

Published 19 March 2026

Blog Image

The Bug Nobody Looked For

Eleven years. That’s how long a fundamental flaw sat inside every major AI model on the planet — ChatGPT, Claude, Gemini, Llama, all of them — and nobody noticed. Not OpenAI. Not Google DeepMind. Not Anthropic. Not Meta. Nobody.

The flaw wasn’t in some obscure edge case or exotic feature. It was in residual connections — the most basic plumbing in the entire transformer architecture. The mechanism that passes information from one layer to the next. The thing that makes deep neural networks work at all. It’s been there since 2015, unchanged, unquestioned, treated like a law of physics.

Then, earlier this month, Moonshot AI’s Kimi team published a paper called “Attention Residuals” that essentially said: this plumbing is broken, here’s why, and here’s how to fix it. The fix gives you 25% more effective compute for free. The inference cost? Under 2%.

I’ve been in ecommerce for 26 years, and this story isn’t really about AI architecture. It’s about something much more dangerous: the assumptions we never think to question.

A Very Quick History of the Plumbing

To understand what Moonshot AI found, you need about ninety seconds of context. I’ll keep it tight.

In 2015, a team at Microsoft Research led by Kaiming He published a paper called “Deep Residual Learning for Image Recognition”. They introduced something called residual connections — a clever trick that lets neural networks grow much deeper without the signal degrading into noise. The idea was simple: instead of each layer transforming the input and passing it forward, you also add a shortcut that passes the original input alongside the transformation. Layer gets the new stuff and the old stuff. Problem of vanishing gradients solved. They called the architecture ResNet, it won every competition going, and the technique became gospel.

Two years later, in 2017, Vaswani et al. published “Attention Is All You Need” — the paper that introduced the transformer. This is the architecture behind every large language model you’ve ever used. And what did they use for passing information between layers? Residual connections. Unchanged from 2015. Same mechanism that was designed for image recognition, bolted directly into the architecture that would power the entire AI revolution.

Nobody questioned it. For eleven years.

What’s Actually Wrong

Here’s the problem, and I’m going to explain it without the maths because this matters for anyone who manages technology, not just people who build it.

Think of a residual connection like an editor working on a document. Each layer of the neural network is like a round of edits — adding new insights, refining ideas, correcting mistakes. In a transformer with residual connections, every round of edits gets added to the document. So far, so good.

But here’s the catch: every edit gets added with equal weight. The brilliant insight from layer 12? Same importance as the trivial formatting change from layer 3. The crucial reasoning step from layer 47? Same weight as the noise from layer 8. There’s no filtering. No prioritising. No mechanism for the model to say “this edit matters, that one doesn’t.”

As MarkTechPost reported, the Kimi team identified three specific problems with this approach. First, every layer receives the exact same blended signal — there’s no selective access to earlier representations. Second, once information is merged into the residual stream, later layers can’t recover specific earlier representations. The signal is irreversibly diluted. Third, as the network gets deeper, layers have to produce increasingly large outputs just to make themselves heard above the accumulated noise.

The paper calls this “representation dilution.” By the time you get to layer 50 or 60 in a deep model, the useful signal from earlier layers is buried under the accumulated weight of every other layer’s contribution. The model is drowning in its own history because it can’t tell the important parts from the irrelevant ones.

Think about that. Every time you ask ChatGPT a question, every time Claude helps you write something, every time Gemini analyses your data — the model is fighting against this dilution in real time. It’s working harder than it needs to, producing worse results than it could, because the plumbing underneath it treats all information as equally important.

The Fix: Attention Goes Vertical

Here’s where it gets elegant. The entire revolution of transformers was built on one core idea: attention. Instead of processing words in sequence, let the model attend to all the words at once and figure out which ones matter most for the current task. That’s attention across the sequence — horizontal attention, if you like.

Moonshot AI’s fix is conceptually identical, just applied in a different direction. Instead of attention across words, they apply attention across layers — vertical attention. Each layer gets to look at all the previous layers and choose which ones to focus on.

Layer 50 might decide that layers 12 and 33 contain the most relevant information for the current computation and weight them heavily, while effectively ignoring the noise from layers 4 through 11. The model learns, during training, which layers are most useful to which other layers. Instead of a uniform stream of everything, each layer gets a curated selection of what actually matters.

The Kimi team’s paper and code describes two versions. Full Attention Residuals applies this mechanism at every single layer — maximum flexibility, but expensive. The practical version, Block Attention Residuals, groups layers into roughly eight blocks and applies attention between blocks. This gives you most of the benefit at a fraction of the cost.

The results are comprehensive. QuantoSei reported that Block Attention Residuals matches the performance of a baseline model trained with approximately 1.25 times more compute. That’s a 25% compute saving for the same quality of output. For context, in a world where training runs cost hundreds of millions of dollars and companies are burning through electricity at the rate of small cities, a 25% efficiency gain is enormous.

They tested it on their own Kimi Linear model — a 48 billion parameter Mixture-of-Experts architecture with 3 billion activated parameters, pre-trained on 1.4 trillion tokens. The benchmark improvements were consistent across the board: MMLU went from 73.5 to 74.6, GPQA Diamond — a demanding reasoning benchmark — jumped from 36.9 to 44.4, BBH rose from 76.3 to 78.0, Math from 53.5 to 57.1, HumanEval from 59.1 to 62.2. Every benchmark improved. That GPQA Diamond leap — from 36.9 to 44.4 — is a massive gain on a benchmark specifically designed to test deep reasoning ability.

Training cost overhead? Under 4%. Inference latency increase? Under 2%. This isn’t a trade-off. It’s a straightforward improvement that’s been sitting there, waiting to be discovered, for over a decade.

Why Nobody Questioned It

This is the part that should bother you. Not the technical details — the human details.

Residual connections worked. They made deep networks trainable. They won ImageNet. They powered the transformer revolution. They sat at the foundation of every breakthrough from GPT-2 to GPT-4, from BERT to Gemini. Why would anyone question something that was so clearly, demonstrably successful?

That’s exactly the trap. “It works” became “it works well” became “it works optimally” — without anyone actually verifying the last step. The mechanism was so fundamental, so foundational, so obviously correct that it became invisible. Nobody questions the foundation of a building while the building is still going up.

Researcher Ziming Liu published an independent analysis of the Attention Residuals paper that adds a nuance worth understanding. He tested the technique on structured versus random data and found that attention residuals work best on structured data — data with clear patterns, rules, and relationships. Standard residual connections can actually perform better on random, chaotic data.

Why does this matter? Because language is highly structured. Grammar follows rules. Logic has patterns. Code has syntax. The data that large language models are trained on is exactly the kind of data where attention residuals provide the biggest advantage. The researchers who built the transformer in 2017 borrowed a mechanism designed for image recognition and never asked whether it was optimal for language. For eleven years, every LLM has been slightly handicapped by a design choice that was never validated for the specific domain it was being used in.

And here’s what keeps me up at night: if the smartest AI researchers on the planet — people at Google, OpenAI, Anthropic, Meta, the best-funded labs in human history with access to the best talent in the world — missed a fundamental optimisation hiding in the most basic piece of their architecture for over a decade, what else have they missed? What other assumptions are sitting there, invisible, quietly making everything slightly worse than it should be?

This Is Your Ecommerce Stack

Now let’s talk about you. Because the same pattern — the exact same pattern — is playing out in every ecommerce business I work with.

Most Shopify stores run on assumptions from 2018-2020. Not technology from that era — that gets updated. Assumptions from that era. The assumptions about how customers browse, how they discover products, what convinces them to buy, and how they want to be communicated with. Those assumptions get baked in during the initial build and then never questioned because the store keeps working.

“It works” isn’t the same as “it works well.” And “it works well” definitely isn’t the same as “it couldn’t work better.” But once something is operational and generating revenue, the incentive to question it drops to zero. You fix things that break. You don’t audit things that seem fine.

The AI researchers had residual connections — plumbing that worked but silently degraded performance at every layer. You have your own version of the same problem. It might be:

  • Your product discovery logic. Still built on the assumption that customers browse category pages the way they did five years ago. They don’t. AI-powered search, visual search, and conversational commerce have changed how people find products. But the three-column product grid and the hierarchical category tree? Still there. Unchanged. Unquestioned.

  • Your SEO strategy. Built for a world before Google AI Overviews. Your team is still optimising for keyword rankings when an increasing percentage of search traffic never reaches your site because Google answers the query directly. The assumptions in your SEO playbook were valid in 2021. They’re actively harmful in 2026.

  • Your email marketing automation. Running 2020-era playbooks. Welcome series, abandoned cart flows, post-purchase sequences — all built on timing and trigger assumptions from a world where customers checked email on desktop at 9am. Customer attention patterns have fundamentally changed, but the flows keep running because they’re “performing okay.” Okay isn’t good. Okay is residual connections — technically functional, quietly suboptimal.

  • Your checkout flow. Designed for a world where the customer navigated to your site, added items to a basket, and stepped through a multi-page checkout. But what about agent-mediated purchases? What about social checkout? What about the growing percentage of buyers who expect to complete a transaction without ever seeing a traditional cart page? Your checkout works. But for whom, exactly?

  • Your tech stack architecture. Fifty Shopify apps, each solving a specific problem, none talking to each other, all adding JavaScript and latency and data fragmentation. It works. It generates revenue. It’s also the equivalent of passing every signal through every layer with equal weight and wondering why the important stuff gets lost.

These aren’t hypotheticals. I see this in every audit I do. Businesses running on accumulated assumptions that nobody has the time, budget, or inclination to question. The store works. Revenue comes in. Why rock the boat?

I worked with a fashion brand last year that had been running the same product recommendation algorithm since their Shopify migration in 2019. It was surfacing “similar products” based on shared tags — a method that made sense when they had 400 SKUs. They now have 3,200. The recommendation engine was technically functional, but it was operating on the same assumption as a residual connection: treat everything with equal weight and hope the right signal emerges. When we rebuilt the logic to account for purchase patterns, seasonal relevance, and margin data, their average order value increased 18%. The old system wasn’t broken. It was just quietly suboptimal — for four years.

Another client, a supplements brand, had been running their abandoned cart email sequence unchanged since 2020. Three emails, sent at 1 hour, 24 hours, and 72 hours after abandonment. “Industry best practice” from the Klaviyo blog they’d read when setting it up. When we actually tested alternative timings against their specific customer behaviour data, the optimal sequence was completely different. The first email at 20 minutes performed 340% better than the one at 1 hour. They’d been leaving money on the table every single day for five years because nobody thought to question a timing assumption they’d copy-pasted from a blog post half a decade ago.

Because the boat has a leak. You just haven’t noticed it yet.

The Structured Data Parallel

Ziming Liu’s finding — that attention residuals work best on structured data — has a direct parallel in commerce that most people will miss.

Ecommerce data is highly structured. Product catalogues follow taxonomies. Pricing follows rules. Order patterns are seasonal and predictable. Customer segments cluster around behaviours. Supply chains operate on logistics that can be modelled. This is precisely the kind of environment where accumulated, unquestioned assumptions do the most damage — because the structure gives you a false sense of confidence that everything is working correctly.

Random data is chaotic by definition. There’s no pattern to optimise, so the plumbing doesn’t matter as much. But structured data — data with inherent patterns and relationships — is exactly where bad assumptions compound. When the data has a pattern and your system isn’t optimised to find it, you lose performance at every layer of the stack. Not dramatically. Not in a way that breaks anything. Just a steady, invisible drag on everything you do.

Product recommendations that are 80% as good as they could be. Pricing that’s 90% optimal. Marketing automation that captures 75% of available revenue. Each individually acceptable. Collectively? You’re leaving a fortune on the table because nobody questioned whether the plumbing was right.

The Compound Cost of “Fine”

Here’s the maths that should terrify every ecommerce operator. Moonshot AI’s Attention Residuals paper showed that fixing one foundational assumption — one — delivered 25% more effective compute across the entire stack. One fix. Twenty-five percent.

Now think about your business. How many foundational assumptions are sitting unchallenged? Five? Ten? Twenty? If each one is costing you even 5% of potential performance, the compound effect is devastating. You’re not losing 5% — you’re losing 5% on top of 5% on top of 5%, multiplied across every customer interaction, every marketing campaign, every product page, every email send.

The AI researchers had one bug in one piece of plumbing, and it cost 25% of compute efficiency across every benchmark. You have multiple unquestioned assumptions across multiple systems. How much are they costing you?

I’ll tell you what I see in practice. When we do deep architecture audits — the kind where we actually question the foundational assumptions rather than just optimise within existing constraints — we typically find 30-50% performance improvements. Not from new technology. Not from replatforming. From questioning the assumptions that were baked in at the beginning and never revisited.

How to Run Your Own Audit

Moonshot AI questioned something that literally every other lab in the world took for granted. You need to do the same thing with your business. Here’s how to start.

List your foundational assumptions. Not your technology choices — your assumptions. “Customers discover products through category navigation.” “Email open rates peak at 10am Tuesday.” “Three product images are enough.” “Our checkout conversion rate is normal for our industry.” Write them down. All of them. You’ll be surprised how many there are and how old some of them are.

Date them. When was each assumption last validated? Not when was it last discussed — when was it last tested with actual data? If the answer is “never” or “2020,” that assumption is your residual connection. It’s probably still functional. It’s almost certainly suboptimal.

Prioritise by impact. Which assumptions, if wrong, would have the biggest impact on your business? Your checkout flow assumptions affect every single transaction. Your email timing assumptions affect every single send. Start with the assumptions that touch the most revenue.

Test ruthlessly. Don’t audit by committee. Don’t form a working group. Pick one assumption, design a test, run it, measure the result. The Kimi team didn’t publish a white paper about how residual connections might be suboptimal. They built the alternative, tested it on a 48 billion parameter model, and published the benchmarks.

Accept that “it works” might be the problem. The hardest part of this entire exercise is accepting that things that are demonstrably functional might be significantly underperforming. Residual connections worked. They powered every AI model on the planet. And they were leaving 25% on the table the entire time. Your equivalent might be the checkout flow that converts at 2.5% when it could convert at 3.5%. The difference doesn’t feel dramatic. Over a year of transactions, it’s enormous.

The 11-Year Lesson

The Moonshot AI paper is a technical achievement, but the real lesson has nothing to do with neural networks. The lesson is about institutional blindness — the way successful systems create their own immunity from scrutiny.

Residual connections were so foundational, so universally adopted, so clearly “right” that questioning them felt absurd. Why would you interrogate something that the entire field has relied on for a decade? Why challenge a mechanism that powered the most successful AI models in history?

Because “the entire field relies on it” is not the same as “it’s optimal.” Because universal adoption is not validation. Because success despite a flaw is not the same as success because of good design.

Your ecommerce stack has its own version of residual connections. Technology and assumptions that were good enough at the time, that have been running long enough to feel permanent, that nobody questions because they don’t break. They just silently make everything a little worse than it could be. Layer by layer. Day by day. Transaction by transaction.

The Kimi team fixed an 11-year-old bug and got 25% more performance for free. The question isn’t whether your business has the same kind of bugs. It’s how many.

The researchers at Moonshot AI didn’t need a bigger model or more data or a breakthrough in machine learning theory. They just needed to look at something everyone else had stopped looking at. The fix was elegant, the cost was minimal, and the performance gains were immediate.

Your business has the same opportunity. Somewhere in your stack, right now, there’s an assumption that was good enough in 2019 and is costing you real money in 2026. Nobody’s going to find it for you. Your platform won’t flag it. Your analytics dashboard won’t highlight it. It looks normal because it’s always been there.

Time to start looking.

Explore Topics

Icon

0%

Explore Topics

Icon

0%