The AI Trap Commerce Teams Can't See — It's Not the Model, It's the Harness

Commerce teams compare AI models obsessively. Meanwhile, the harness wrapping those models is building irreversible lock-in nobody's pricing.

33 min read

Published 6 March 2026

The Wrong Comparison Is Costing You

Every week, another comparison drops. Claude versus GPT. Gemini versus the lot. The benchmarks get sliced into neat tables, someone declares a winner, and by next Tuesday the rankings shuffle again. Commerce teams — agencies, brands, platform developers — consume these comparisons like match-day results. Who's ahead? Who's slipping?

It's the wrong question. Not slightly wrong. Fundamentally, structurally, expensively wrong.

When you pick an AI coding agent, a content generation tool, or an agentic commerce platform, you're not just choosing a model. You're choosing a harness — the entire execution environment that determines where the AI runs, what it can access, how it remembers context between sessions, how it connects to your existing tools, and what happens when something goes wrong. The model is the brain. The harness is everything else: the hands, the memory, the workspace, the safety rails, the integration layer.

And here's the uncomfortable truth that nobody in commerce is talking about: the harness is where the lock-in lives.

At the AI Engineer Summit in January 2026, Anthropic presented results showing that when Claude models are combined with a multi-agent harness, benchmark scores increase dramatically — in some configurations nearly doubling compared to the same model running inside a simpler execution framework. Same brain. Radically different performance. That's not a rounding error. That's a structural multiplier determined entirely by the harness, not the model.

If you're running an ecommerce operation and you think your AI tool choice comes down to which model is cleverest, you're pricing the wrong variable.

Two Architectures, Two Philosophies, One Industry That Hasn't Noticed

The two dominant AI coding platforms — Anthropic's Claude Code and OpenAI's Codex — have made genuinely different architectural bets about how humans and AI should work together. These aren't cosmetic differences. They're philosophical commitments that shape every interaction your team has with the tool, and they're diverging on purpose.

Claude Code runs in your actual terminal. Your shell, your environment variables, your SSH keys, your file system. Its philosophy is essentially "bash is all you need" — rather than building dozens of specialised tools with expensive token-consuming descriptions, the agent chains together composable Unix primitives like grep, git, and npm on the fly. Analysis has shown GitHub's MCP server's 38 tools consume roughly 15,000 tokens of tool descriptions, whilst the GitHub CLI achieves equivalent functionality with far fewer tokens in the context window. This gives the agent access to everything a human engineer would have.

The trade-off? The trust boundary is your entire workstation. You're handing the AI the keys to your machine.

Codex takes the opposite approach. Every task runs in an isolated cloud container. Your code is cloned in, internet access is disabled by default, and the agent works independently. Where Claude Code manages risk through incrementalism and human oversight, Codex manages risk through isolation and mechanical enforcement. It's safer by default, but the agent can't reach your local tools, your existing scripts, your custom workflows.

One is a collaborator sitting at the desk next to you. The other is a contractor in a clean room, sliding finished work under the door.

Now — why should a commerce director, an agency founder, or a Shopify Plus merchant care about the architectural decisions of AI coding platforms?

Because these architectures are leaking. They're leaking into content tools, marketing automation, customer service agents, and every AI-powered product your team will adopt in the next eighteen months. The harness pattern isn't just a developer concern. It's the template for how every AI tool you use will handle your data, your workflows, and your institutional knowledge. The commerce technology stack of 2027 won't be a list of apps — it'll be a constellation of AI agents, each running inside a harness that either compounds your team's knowledge or discards it at the end of every session.

The Memory Problem Commerce Can't Afford to Ignore

Here's where the harness divergence gets genuinely dangerous for commerce teams.

Anthropic's engineering team framed their core problem vividly: imagine a software project staffed by engineers working in shifts, where each new engineer arrives with zero memory of what happened on the last shift. That's what happens when an AI agent works across multiple context windows. The model is intelligent, but it starts each session from a blank page.

Their solution is structural. Claude Code uses a two-part pattern: an initialiser agent that sets up the project with a structured feature list, progress log, and clean commit history, followed by a coding agent that reads those artifacts at the start of every session, makes incremental progress, and leaves structured notes for the next session. The progress file and git history become the agent's institutional memory. Developers who invest in context files — things like claude.md or AGENTS.md — build a compounding asset. The more context accumulates, the better every subsequent session works.

OpenAI solved the same problem differently. In their landmark harness engineering report, they described building a million-line internal product over five months using only Codex agents — zero lines of manually written code, roughly 1,500 pull requests, initially driven by just three engineers. In their harness, anything not in the repository is illegible to the agent and therefore doesn't exist. Architectural decisions, product principles, alignment threads — all of it gets encoded as documentation within the codebase itself. They tried the "one big agents.md file" approach and it failed spectacularly. When everything is marked important, nothing is. Instead, they built a progressive disclosure system of focused, cross-linked documentation that the agent navigates as needed.

One harness makes the agent remember. The other makes the codebase remember.

For commerce teams, this distinction matters enormously. If your agency has spent six months building context files, custom skills, workflow documentation, and MCP connectors around Claude Code, that institutional knowledge doesn't transfer to Codex. It's not just that you'd need to learn new commands. You'd need to rebuild every compounding layer of automation from scratch, in an architecture that may not even support the same abstractions.

Calvin French-Owen, who helped launch the Codex web product and now uses both tools extensively, documented this compounding effect in practice. He started with a simple /commit skill — just telling the model to commit consistently. Then he needed agents working in separate worktrees, so he added /worktree. Then he noticed he always planned first, so he added /implement. Eventually he had six layers of workflow automation, each built on the previous one, each specific to Claude Code's harness architecture — its skill system, its context forking, its sub-agent model.

Moving to a different harness didn't mean learning new commands. It meant rebuilding the entire compounding train from scratch.

Now multiply that by every engineer on the team. Every project they've touched. Every markdown file they've accumulated. Every MCP connector they've deployed. Every custom workflow they've refined.

That's the lock-in nobody is pricing.

Why This Is the New Cloud Wars — and Commerce Lost the Last One

If this pattern feels familiar, it should. We've been here before.

In 2010, you could have told an enterprise that AWS and Azure were "basically the same" because they both offered virtual machines and object storage. You would have been technically correct and strategically catastrophic. The organisations that understood how AWS Lambda would reshape application design differently from Azure Functions — those organisations made the correct decisions. Everyone else spent the next decade paying migration consultants.

The AI harness divergence is the same pattern, running faster. And commerce has a particularly bad track record with this kind of decision.

Think about it. How many ecommerce businesses are still paying for Shopify apps they adopted in 2019 because their entire product catalogue workflow is built around them? How many agencies are locked into specific project management tools because five years of process documentation lives there? How many brands chose a CDP in 2021 based on a feature comparison and are now paying six figures annually because migrating their customer segments would take months?

Commerce teams are serial lock-in victims because they optimise for features at the point of purchase and ignore switching costs that compound over time. The AI harness decision is the same trap with higher stakes, because AI tools accumulate institutional knowledge faster than any SaaS product before them.

A Shopify app stores your product data. An AI harness stores your team's working patterns.

The InfoQ analysis of OpenAI's harness engineering approach highlighted a particularly telling detail: Codex replicates whatever patterns exist in the repository, including suboptimal ones. This inevitably leads to drift. OpenAI's solution was to encode "golden principles" into the repo and build automated cleanup processes where background Codex tasks scan for deviations and open targeted refactoring PRs. The repository eventually polices itself. That's elegant — and it's also an enormous amount of infrastructure that only works within Codex's harness philosophy. Try to port it to Claude Code's terminal-native, context-file-driven architecture and you're essentially starting over.

The quarterly compounding is worse for commerce teams than for pure software engineering teams, because commerce workflows span more tools, more people, and more operational surface area. An engineering team might have a dozen MCP connectors. A commerce operation running product feeds, ad platforms, email sequences, inventory systems, customer support, and analytics has dozens. Each connector is another tendril binding you to a specific harness architecture.

The Five Fault Lines Commerce Leaders Need to Understand

The architectural gap between AI harnesses isn't one thing. It's at least five things compounding simultaneously in different directions. Commerce leaders don't need to understand the engineering, but they absolutely need to understand the implications.

1. Execution philosophy. Claude Code gives the agent your actual environment — your tools, your scripts, your databases. Codex gives the agent a controlled sandbox with purpose-built tools like Chrome DevTools protocol access and ephemeral observability stacks (Victoria Logs and Victoria Metrics spin up per worktree and disappear when done). The practical difference for commerce: if your team has built custom scripts for inventory sync, order management, or pricing rules, Claude Code's harness can use them directly. Codex's harness needs them rebuilt as RPC endpoints or integrated through its app server. Neither approach is wrong, but switching between them means re-engineering your tool layer.

2. State and memory. As discussed: one makes the agent remember, the other makes the codebase remember. For commerce operations that involve long-running context — a seasonal campaign that evolves over weeks, a migration project with dozens of dependencies, a content calendar that builds on previous performance data — the memory architecture determines whether your AI assistant gets smarter over time or starts fresh every Monday morning.

3. Context management. Both companies learned that more context isn't better if it's not curated. Claude Code manages context through compacting (automatically summarising older context) and delegation to sub-agents with their own windows. Codex relies on isolation — each task runs in a clean sandbox, and tasks don't compete for space. For commerce teams running complex, interconnected operations, this determines whether you can ask an agent to "update the pricing strategy based on last month's margin analysis and this week's competitor movements" or whether that request needs to be broken into three separate, context-free tasks.

4. Tool integration. Both platforms support MCP (Model Context Protocol), the emerging open standard for connecting AI agents to external tools. But the implementation philosophies are radically different. Claude Code was built around MCP from day one, with a just-in-time tool retrieval system that keeps the context window lean. Codex uses a bidirectional JSON-RPC harness that exposes tools programmatically. When Composio's testing team tried to get Codex working with Figma and Jira MCPs, they had to build a custom proxy adapter. For commerce teams whose workflows span Shopify, Klaviyo, Google Analytics, Gorgias, and a dozen other platforms, the integration depth beneath the protocol matters as much as the protocol itself.

5. Multi-agent architecture. Claude Code spawns sub-agents that share task lists and can message each other — one builds the API whilst another builds the front end whilst a third writes tests. Codex runs each task in isolated sandboxes, with coordination happening through the codebase itself via git branches. For commerce, this is the difference between an AI that can simultaneously update your product feed, adjust your ad bids based on inventory levels, and modify your email sequences in coordinated fashion, versus an AI that handles each task independently and merges the results. Both work. They work differently. And your team builds processes around whichever one they choose.

What Commerce Teams Should Actually Do About This

The practical advice here isn't "pick the right harness." Both Claude Code and Codex are excellent tools that will continue improving. Experienced developers are already using both in complementary workflows — Claude Code for planning and orchestration, Codex for implementation with fewer bugs. The practical advice is: understand that you're making an architectural commitment, not a subscription decision, and price it accordingly.

First, audit your current AI tool adoption with harness awareness. What institutional knowledge has your team already accumulated in a specific harness? How many custom workflows, context files, integration connectors, and process adaptations exist? That's your current switching cost. If it's already significant, you need a very compelling reason to change, and "the other model scored 3% higher on a benchmark" isn't one.

Second, separate the model from the harness in every evaluation. When someone on your team says "Claude is better" or "GPT is better," ask: better at what? Better model, or better harness for how we work? The answer matters because model advantages are temporary — they shift with every release cycle. Harness advantages compound — they grow with every week your team uses the tool.

Third, invest in portable institutional knowledge. Regardless of which harness you choose, document your workflows, your decision frameworks, and your operational patterns in formats that aren't harness-specific. A well-written operational playbook in markdown transfers between systems. A stack of Claude-specific skill files does not. The more of your institutional knowledge lives in harness-agnostic formats, the lower your eventual switching cost.

Fourth, watch the convergence — or lack of it. If harnesses start converging (standardising on similar patterns for memory, context management, and tool integration), switching costs will decrease over time and the model comparison becomes more relevant. If harnesses continue diverging — which is what's happening right now, as both Anthropic and OpenAI double down on their respective philosophies — switching costs will increase every quarter, and the harness decision becomes the most consequential technology choice your commerce team makes this year.

Fifth, plan for the harness era in your vendor evaluations. Every new AI tool your commerce team evaluates — whether it's a customer service bot, a content generator, a merchandising assistant, or an analytics agent — is built on a harness. Ask the vendor: where does the model run? What does it have access to? How does it maintain memory between sessions? What happens to our accumulated context if we leave? These questions don't appear on any vendor comparison chart. They should be the first questions you ask.

Finally, stop comparing brains in jars. The model comparison era — which model is smartest, which scores highest, which generates the best copy — is over. We're now in the harness era, where the question isn't how intelligent your AI is but how effectively that intelligence integrates into your actual work. For commerce teams operating across dozens of tools, managing complex workflows, and building institutional knowledge every day, the harness isn't an implementation detail.

It's the decision that determines whether your AI investment compounds or resets to zero every time the market shifts.

And every quarter you wait to understand this, the switching cost goes up.