As AI agents scale from developer tools to mass-market assistants, the underlying compute infrastructure faces a fundamental mismatch: the cloud was built for one-to-many applications (one server, many users), but agents are one-to-one (one instance per user per task). This shift demands new infrastructure primitives — lighter-weight execution environments, new identity models, and new economic frameworks — to make agents viable at scale.
The scaling challenge
Cloudflare's analysis of agent scaling math illustrates the problem: if 100 million US knowledge workers each used an agentic assistant at 15% concurrency, that requires capacity for ~24 million simultaneous sessions. At 25–50 users per CPU, that's 500K–1M server CPUs — just for the US, with one agent per person. Multiple agents per person and global scale push this to orders-of-magnitude shortfalls in available compute.
Containers vs. isolates
Traditional container-based approaches (the current default for coding agents) give each agent a full execution environment with filesystem, git, bash, and arbitrary binary execution. This works but is expensive and slow to provision.
V8 isolates (as used by Cloudflare Workers) offer a lighter alternative:
- ~100x faster startup than containers (milliseconds vs. seconds)
- ~100x more memory-efficient — megabytes instead of hundreds of megabytes per instance
- Ephemeral by design — spin up per-request, tear down immediately
- Make per-unit economics viable for non-developer agent use cases
The tradeoff: isolates don't support arbitrary binaries or filesystem access, so coding agents still need containers. The future is likely a hybrid — containers for developer agents, isolates for the mass-market.
The "horseless carriage" phase
Current agent infrastructure shows classic early-adoption patterns:
- Agents use headless browsers to navigate human-designed websites instead of structured protocols like MCP
- Many MCP servers are thin REST API wrappers rather than rethinking the interaction model
- CAPTCHAs verify "are you human?" when the real question is "which agent are you, who authorized you, and what are you allowed to do?"
- Full containers are spun up for agents that only need a few API calls
New infrastructure needs
- Agent identity and authorization — moving beyond human-centric auth models
- Agent economics — new payment rails (e.g., HTTP 402 / x402 Foundation) since agents don't see ads or click paywalls
- Publisher governance — tools for content owners to set policies for agent interactions
- Security built into execution — not bolted on afterward; prompt injection, data exfiltration, and unauthorized access need to be handled at the platform level
Model routing and embeddings
- Claude Desktop now appears to support third-party inference for Cowork and Code, which makes local models and OpenRouter-style backends first-class options for some desktop workflows
- Gemini Embedding 2 is now generally available in Gemini API and Vertex AI, which matters because production retrieval systems increasingly need native multimodal embeddings rather than text-only vectors
- The infrastructure question shifts from “which model?” to “which routing, embedding, and serving layer can absorb model churn without rewriting the app?”
Observability for AI-built systems
As AI agents generate and maintain more software, standardized observability becomes critical — humans increasingly debug and operate systems they did not hand-write, and the instrumentation layer becomes the only reliable way to understand runtime behavior. OpenTelemetry, an open-source CNCF framework born from the merger of OpenTracing and OpenCensus, is the de facto standard: vendor-neutral instrumentation (instrument once, export anywhere), unified signals (traces, metrics, logs correlated via shared context), the OTLP wire protocol, auto-instrumentation for popular frameworks, a collector pipeline with 200+ components, and native SDKs for 12+ languages. Tracing and metrics APIs are production-stable.
The implication for agent infrastructure: any platform serving AI-built or AI-operated systems at scale needs OTel-compatible instrumentation built in, not bolted on — otherwise debuggability collapses as human authorship thins out.