Everyone paying attention to AI right now is watching the model race. Claude vs. GPT-5.5. Benchmark scores. Context windows. Who hallucinates less.
I want to talk about a different race, the one that will actually determine which companies build durable AI systems and which ones spend the next two years rearchitecting.
It’s the race for the agent runtime layer. And in the last three weeks, it got very real.
Table of contents
Open Table of contents
The layer nobody had formally named
In February 2026, a paper out of arxiv introduced a concept called “AI Runtime Infrastructure.” The abstract was precise about what it was and wasn’t:
Existing infrastructure has addressed adjacent concerns: model serving, orchestration frameworks, post-hoc observability. None of these address failures, inefficiencies, and safety risks that emerge during agent execution.
The paper proposed a distinct execution-time layer that sits above the model and below the application, actively observing, reasoning over, and intervening in agent behavior while the agent is running. Not before it starts. Not after it finishes. During.
The argument was that most costly agent failures happen at runtime, after planning has begun, and outside the scope of static orchestration or offline analysis. You can have the smartest model and the cleanest prompt and still watch your agent spiral because nothing is watching the execution itself.
This layer, the paper argued, is the missing piece for production-grade agents. It needed to be formalized as infrastructure, not bolted on as application logic.
Three months later, it’s the main competitive battlefield between the two largest AI companies in the world.
What happened in the last three weeks
April 8: Anthropic launches Claude Managed Agents into public beta. Checkpointing, scoped permissions, credential management, long-running session support. Not a new model. Infrastructure primitives for running agents in production.
April 23: OpenAI ships GPT-5.5. The positioning is unusual. Sam Altman’s team describes it not as a smarter chatbot but as an agent runtime, built to hold long task state, chain tool calls without losing context, and run multi-step workflows autonomously.
Two different companies, two different bets on what the runtime layer should be.
May 6: Anthropic goes further. At the Code with Claude developer event in San Francisco, Chief Product Officer Ami Vora ships three new features for Managed Agents that collectively represent the most significant push toward execution-time infrastructure any major AI company has made to date.
What Anthropic actually shipped on May 6
Dreaming (research preview): A scheduled background process that runs between agent sessions, reviews up to 100 past transcripts, identifies recurring patterns, and writes new memory entries that the next session can use. Anthropic compares it to hippocampal memory consolidation, the way the brain replays the day’s events during sleep and decides what to keep. Memory lets an agent capture what it learns during a run. Dreaming refines that memory between runs, and pulls shared learnings across agents in a multi-agent system.
Outcomes (public beta): You write a rubric describing what success looks like. A separate Claude instance, running in its own context window, evaluates the agent’s output against that rubric independently of the agent’s own reasoning. If the output fails, the grader identifies exactly what needs to change, and the agent takes another pass. The loop continues until the output meets the bar. In Anthropic’s internal benchmarks, outcomes lifted task success by up to 10 percentage points on harder problems.
Multiagent orchestration (public beta): A lead agent breaks a complex job into chunks and delegates each to a specialist subagent with its own model, prompt, and tools. Up to 20 specialists can run in parallel on a shared filesystem. The whole flow is traceable in the Claude Console.
The results from early pilots are not abstract. Harvey, the legal AI company, saw task completion rates climb roughly six times after enabling dreaming. Netflix used multiagent orchestration to process build logs from hundreds of pipelines in parallel. Wisedocs cut document review time by 50 percent.
The framing that matters
There is a category error happening in how most people are talking about this.
GPT-5.5 is a smarter engine. Claude Managed Agents is more reliable rails. Both matter. But they are solving different problems, and treating them as equivalent is the same as comparing a faster CPU to a better operating system.
The model war is quarterly. Every few months, a new benchmark arrives and the leaderboard shuffles. Infrastructure lock-in compounds over years. The teams that build on top of a runtime layer, rather than building around it or ignoring it, will have architectural advantages that don’t reset when the next model ships.
I keep coming back to an analogy from backend systems. Kubernetes didn’t win because containers were novel. Containers existed for years. Kubernetes won because it was the first time infrastructure matched how teams actually operated at scale. The control plane was the missing piece, and once it existed, the whole model of how you deployed software shifted permanently.
AI is at the same inflection point. The central question for production systems in 2026 is not which model is smartest. It is who owns the control plane for agent execution.
What this means if you’re a backend engineer building on agents
The runtime layer is no longer something you have to build yourself from scratch. Managed primitives exist now: persistent memory, self-grading loops, parallel subagent dispatch. Three months ago you were assembling those by hand if you wanted them at all.
But the flip side is that choosing your runtime layer is now an architectural commitment, not a tooling preference.
Choosing Anthropic’s Managed Agents means accepting their memory model, their permission architecture, their orchestration primitives. Choosing OpenAI’s Responses API and GPT-5.5 means betting on a different model of what the runtime should look like. Rolling your own means taking on the engineering cost and the organizational burden of maintaining it as the space evolves under you.
None of these are obviously wrong. But they are not equivalent, and the engineers who understand the tradeoffs will make better decisions than the ones treating it as a vendor preference.
A few specific things worth thinking through before you commit to a runtime layer:
Memory model: Where does agent state live? Who owns it? What happens when a session fails mid-execution? Dreaming is compelling, but it means Anthropic’s infrastructure has access to your agent’s session history. That is a compliance question as much as an engineering one, especially if you are in a regulated industry.
Permission architecture: The PocketOS incident I wrote about earlier this month is the canonical example of what happens when agent permissions are not designed as first-class infrastructure. Managed Agents ships scoped permissions and credential management as primitives. If you are rolling your own runtime, you are also rolling your own blast-radius controls. That is a significant engineering surface area.
Observability: The multiagent orchestration that Anthropic shipped is traceable in the Claude Console. That is the right instinct. But it also means your agent execution traces live in Anthropic’s observability infrastructure, not yours. Whether that is acceptable depends on your requirements.
The grader architecture in Outcomes is genuinely interesting from a distributed systems perspective. Running a separate evaluator in its own context window, independent of the agent’s reasoning trajectory, is a clean way to avoid the self-assessment bias problem. The agent cannot grade on a curve because the grader never sees the agent’s work, only the output and the rubric. This is the same principle as having a separate test suite that doesn’t have access to the implementation.
The part nobody is saying clearly enough
The February paper made a point that I think gets lost in the product announcements: the runtime infrastructure layer also motivates an entirely new approach to evaluating agentic systems.
Traditional evals summarize outcomes after execution. Did the agent produce the right output? But if the runtime layer is actively intervening during execution, the interesting evaluation questions are different. Did the intervention happen at the right time? Did the recovery cost less than the failure would have? What does the execution trajectory look like for a task the agent eventually gets right versus one it fails on?
This is an open research problem. The tooling for it barely exists. The companies that solve evaluation at the runtime layer, not just at the output layer, will have a measurement advantage that compounds into a quality advantage over time.
That is the part of this space I am most interested in building toward.
Where this goes
Model quality is no longer a competitive moat. The race is for the control plane: memory, orchestration, policy enforcement, self-improvement loops.
The February paper called it. The market just showed up.
If you are building agents in production today, you are building on top of someone else’s infrastructure decisions whether you realize it or not. The runtime layer exists. The question is whether you are choosing it deliberately or inheriting it by accident.
Choose deliberately.
This is the third post in a series on AI agents in production. The first covered the PocketOS blast-radius incident. The second told the origin story of LogIQ, an AI investigation tool I built after a production debugging failure that took hours longer than it should have.