AI Agents Need Rails, Not More Chatbots

JPMorgan's Ask DAVID and Vercel's Just Bash point to the same answer: useful agents need governed execution rails, not better chat boxes.

The Principal

35 min read

Published 5 May 2026

JPMorgan did something unfashionably useful. It showed how an AI agent actually gets work done inside a serious organisation.

Not a mascot. Not a chatbot with a pleasant name. Not a demo where someone asks for a summary and everyone applauds because the paragraph sounds confident. Ask DAVID, the private bank's investment research assistant, is interesting because it exposes the skeleton of the agent era: a supervisor, specialist agents, structured data, unstructured knowledge, analytics, human review, and a reflection check before the answer leaves the building.

That is the bit worth paying attention to. The future of useful AI is not another chat interface. It is work running on rails.

The phrase sounds boring because it is supposed to. Rails are boring in the same way accounting controls are boring, deployment pipelines are boring, and payment authorisation is boring. The boring parts are what let serious work happen without turning every task into a trust fall.

The last 18 months trained companies to ask the wrong question: which model should we use? The better question is: what system surrounds the model so it can touch real work without burning the place down?

The answer is starting to appear from several directions at once. JPMorgan's Ask DAVID shows the financial services version. Anthropic's Model Context Protocol shows the tool-connection version. Vercel Labs' Just Bash shows the execution-scratchpad version. Shopify's internal AI mandate shows the organisational version. Stitch them together and the shape is obvious: agents need governed routes through work. They need rails.

The Chatbot Was Always A Training Wheel

The chatbot made AI approachable. It also trapped AI in the least useful possible metaphor.

Chat is excellent for intent. It is a poor container for operations. Real work is not a conversation. Real work is a chain of decisions, checks, data pulls, transformations, approvals, writes, retries, and audit trails. A human may start it with a sentence, but the value lives in everything after the sentence.

Ask a normal chatbot why revenue dropped yesterday and it will produce a plausible essay. Ask a useful commerce agent the same question and it should do something entirely different. It should check Shopify orders, compare yesterday with the previous Tuesday, pull traffic from GA4, inspect campaign changes, look at inventory availability, check whether a feed disapproval landed, separate observed facts from inference, then tell you what changed and what it recommends doing next.

The first system generates prose. The second system performs work.

This distinction sounds obvious until you inspect most AI product demos. The interface is polished, the wording is fluent, and the actual operational depth is tissue paper. There is no evidence bundle. No policy gate. No durable memory. No clear boundary between a fact, a guess, and a recommendation. No answer to the simple question every operator eventually asks: where did you get that from?

That is why the JPMorgan example matters. The institution is not asking an assistant to riff on investment products. It is building a domain-specific system for high-stakes research. Billions of dollars sit behind the answers. That forces architectural honesty.

Ask DAVID stands for Data, Analytics, Visualisation, Insights, and Decision-making assistant. The name is charmingly corporate. The architecture is not. It uses a supervisor agent to understand the user's request and route work to specialist agents: structured data, unstructured retrieval, and analytics. It personalises the answer by role. It runs a reflection check. It keeps humans in the loop when accuracy matters.

That is not a chatbot. That is an operating pattern.

What JPMorgan Accidentally Told Everyone

The public coverage of Ask DAVID will mostly flatten it into a headline: big bank builds AI research assistant. Fine. But the useful part is underneath.

First, there is a supervisor. This is not a decorative layer. The supervisor is the difference between a tool pile and a system. It decides whether the user is asking a general question, a specific product question, an analytics question, or something that needs human review. Without a supervisor, the agent has no real judgement about how work should be routed.

Second, the system separates structured and unstructured work. That matters. Asking a database a question is not the same as retrieving meaning from meeting notes, presentations, emails, and research documents. One wants careful querying. The other wants retrieval, citation, freshness checks, and synthesis. Throwing both into one generic agent is how you get plausible nonsense.

Third, analytics is treated as its own capability. This is the part many companies will miss. A serious agent does not just fetch data. It runs analysis. It calls models. It performs calculations. It creates visualisations. It explains assumptions. Sometimes it should generate code, but only in a controlled environment and usually with supervision.

Fourth, the answer is personalised. The same fact should not be presented in the same way to every person. A due diligence specialist may need the full trail. A client advisor may need the short version with the risk clearly explained. A founder wants the commercial implication. A support lead wants the customer-handling implication. Role matters.

Fifth, reflection appears before final output. This is subtle and crucial. A policy gate asks whether the action is allowed. A reflection gate asks whether the answer is any good. Is it grounded? Are the sources strong enough? Is the confidence high enough? Is there missing evidence? Should the system try again, ask a human, or refuse to act?

Most AI products blur these checks. Ask DAVID separates them. That is the serious version.

There is a reason financial services finds this shape before many software companies do. Regulated industries cannot pretend that fluency equals correctness. They have been living with audit, approval, and control requirements for decades. The agent industry is rediscovering old operational truths and calling them new architecture.

Just Bash Is Interesting Because It Is Boring

Vercel Labs' Just Bash looks like a small developer tool: a virtual bash environment with an in-memory filesystem, written in TypeScript for AI agents. That undersells it.

Just Bash gives agents a controlled scratchpad for real work. It supports common Unix commands, pipes, redirection, variables, loops, JSON processing with jq, YAML processing, SQLite, optional Python, optional JavaScript, and explicit network controls. The default filesystem is in memory. Overlay mode can read from a real project while keeping writes virtual. Network access is off unless configured. Execution limits protect against runaway scripts.

None of that sounds magical. That is precisely why it matters.

Useful agents need somewhere to inspect evidence. They need to compare files, validate imports, transform JSON, aggregate CSVs, run small calculations, test connector outputs, and produce artefacts a human can inspect. A virtual shell is not glamorous. It is operationally useful.

Consider a merchant asking whether 200 product updates are safe to publish. A weak AI system reads the proposed changes and says they look good. A better system puts the current catalogue and proposed catalogue into a sandbox, runs diffs, checks required fields, flags duplicate SKUs, finds missing metafields, searches for banned claims, compares prices against margin rules, and only then recommends whether to proceed.

Or consider a support team asking what customers complained about this week. The agent can pull tickets, normalise the JSON, cluster repeated phrases, count product mentions, redact personal data, and produce a grounded summary. The shell commands themselves become part of the evidence trail. The operator can see what happened.

This is not about nostalgia for terminals. It is about giving the agent a safe place to do the kind of mundane work that makes answers trustworthy.

There is also a strategic point here. The shell is a universal work surface. It is not tied to one SaaS vendor's idea of a workflow. Files go in. Commands run. Evidence comes out. For agents, that is a useful primitive.

Rails Are The Missing Product Layer

A rail is not a prompt. It is not a Zapier-style if-this-then-that. It is not a generic workflow diagram.

A rail is a governed path through recurring work. It defines the trigger, the context required, the tools allowed, the decision rules, the evidence expected, the approval threshold, the execution route, and the audit trail.

In commerce, a rail might be: when inventory drops below 50 units and return on ad spend is above 3x, check supplier lead time, inspect margin, propose a reorder, and increase spend only after approval. Another rail might be: every morning, inspect yesterday's trading, identify the three biggest anomalies, explain likely causes, and propose actions. Another might be: when a Google Merchant Centre feed item is disapproved, diagnose the cause, rewrite the product field, and queue the change for review.

The rail is the difference between an agent that can do a task once and an agent that can become part of the operating rhythm of a business.

This is where most AI automation products remain too shallow. They sell tasks. Businesses need repeatable operational judgement. The difference is enormous.

A task says: summarise this report.

A rail says: every weekday at 8am, compare trading performance against expected patterns, check whether any variance is explained by stock, traffic, campaigns, pricing, feed status, or support issues, produce an evidence-backed briefing, and escalate only the decisions that need a human.

The second version has memory. It has tolerance thresholds. It knows what normal looks like. It knows which systems matter. It knows when not to interrupt. It compounds.

That compounding is the commercial prize. A merchant with one agent demo has a curiosity. A merchant with 15 daily rails has a new operating layer. The switching cost is not the software subscription. It is the accumulated operational understanding embedded in those rails.

The Tool Layer Is Becoming Standardised

The industry is also quietly solving the connection problem. Anthropic's Model Context Protocol gave developers a standard way to expose tools and context to AI systems. Anthropic later described how code execution with MCP can reduce context overhead dramatically by letting agents call tools through a code layer instead of carrying every tool definition in the prompt.

OpenAI's Agents SDK documentation now discusses MCP too. The point is not vendor politics. The point is convergence. Agents need standardised access to tools, data, and execution environments. Everyone serious is moving there.

But tool access alone is not enough. Giving an agent 200 tools does not make it operationally competent. It often makes it worse. The hard part is knowing which tool to use, when to use it, what evidence to require, and when to stop.

This is why the supervisor pattern matters. Tool catalogues need routing. Routing needs policy. Policy needs audit. Audit needs evidence. Evidence needs reflection. Reflection needs human review when the machine is out of its depth.

That chain is the product. Not the chat window.

Shopify's AI stance points in the same direction from a different angle. Tobi Lütke's leaked and then confirmed memo reportedly made AI usage a baseline expectation and asked teams to prove AI could not do the work before asking for more headcount. CNBC covered the blunt version: prove AI cannot do jobs before requesting more people. TechCrunch framed it as teams needing to consider AI before growing headcount.

Whether one likes the management tone is beside the point. The assumption has changed. AI is no longer treated as a side tool. It is becoming part of the default production function of the company.

Once that happens, the question moves from adoption to control. How do you make AI work repeatably? How do you know what it did? How do you stop it from taking actions it should not take? How do you make it better next week?

Again: rails.

Commerce Will Feel This First

Commerce is one of the best test beds for this architecture because the work is messy but bounded.

A merchant's operating context spans product data, orders, customers, inventory, marketing campaigns, ads, analytics, support tickets, reviews, returns, margin, fulfilment, feeds, discounts, and suppliers. No single system owns the truth. Shopify has some of it. GA4 has some. Klaviyo has some. Gorgias has some. Google Merchant Centre has some. The founder has some in their head. The agency has some in old Slack threads.

The business does not need another dashboard showing more fragments. It needs a synthesis layer that can connect them and act carefully.

Take a simple daily question: what should we do today?

A commerce agent running on rails should know whether revenue is off because traffic fell, conversion fell, average order value fell, ads changed, product availability broke, discounts expired, reviews shifted, or support complaints spiked. It should know which of those are facts and which are hypotheses. It should know which actions are safe to take automatically, which need approval, and which need a human because the evidence is thin.

This is why agentic commerce is not really about shopping bots. Shopping bots are a visible surface. The bigger shift is operational: the merchant gets an always-on execution layer that understands the business well enough to run the repetitive parts and escalate the judgement calls.

That is also why the phrase 'one Wilson per merchant' is more interesting than it first sounds. The value is not a named assistant. The value is a domain-specific operator learning the merchant's business, encoding repeatable work into rails, and connecting safely to the tools that already run the shop.

The end state is not a merchant typing every request into a chat box. The end state is 10, 20, then 50 rails running in the background, with the merchant only pulled in when judgement, taste, risk, or money requires it.

The Accuracy Problem Is Really A Systems Problem

People keep asking whether models are accurate enough for agents. The answer is incomplete because the model is only one part of the system.

A weak system asks a model to be correct from memory. A stronger system constrains the task, retrieves evidence, uses the right tool, checks the result, reflects on the answer, and asks for approval before acting. The model still matters. But it is no longer carrying the entire burden alone.

This is how serious software has always worked. We do not ask one brilliant developer to deploy production code straight from their head. We use tests, linters, code review, staging, observability, rollback, permissions, and incident process. Agents need the same kind of operational wrapper.

The companies that win with AI will not be the ones with the prettiest prompt library. They will be the ones that turn work into governed systems. They will know which parts can be automated, which parts require approval, which parts need human expertise, and which parts should never be touched by a machine.

The Ask DAVID lesson is not that every company needs a banking research bot. The lesson is that useful agents are organised. They have roles. They have memory. They have data boundaries. They have review points. They have evidence. They have a path to action.

The Just Bash lesson is not that every company needs to expose a terminal to its AI. The lesson is that agents need controlled execution environments where they can do small, verifiable pieces of work and leave a trail.

The Shopify lesson is not that every company should copy Shopify's internal memo. The lesson is that AI is becoming part of how companies allocate work. Once that is true, governance stops being compliance theatre and becomes operating infrastructure.

The Agent Era Will Be Won In The Boring Middle

The market likes extremes. Either AI is magic and replaces whole departments overnight, or it is overhyped autocomplete with a cloud bill. Both takes miss the boring middle where the real value is already forming.

The boring middle is where an agent checks 500 rows for malformed metafields. Where it compares three exports and notices a campaign changed the same day conversion dropped. Where it refuses to publish a product update because the claim is unsupported. Where it drafts the refund response but asks for approval because the order value is above the threshold. Where it runs the same morning rail for six months and learns what normal looks like.

That does not make for a cinematic demo. It makes for a business that works better.

The agent companies worth watching will build this middle layer properly. Supervisor agents. Specialist agents. Sandboxed execution. Standard tool connections. Evidence bundles. Reflection gates. Human review. Approval flows. Memory that improves the next run. Rails that turn repeated judgement into durable operations.

Everything else is theatre.

The chat interface will stay. It is a good doorway. But the doorway is not the house.

The house is the operating system underneath: the rails, the tools, the memory, the policies, the evidence, and the controlled execution paths that let AI do work without asking humans to suspend disbelief.

That is where the agent era becomes useful. Not when the chatbot sounds more human. When the work becomes safer, faster, and more repeatable than the human process it replaces.