AI Agent Infrastructure — Petter's wiki

As AI agents scale from developer tools to mass-market assistants, the underlying compute infrastructure faces a fundamental mismatch: the cloud was built for one-to-many applications (one server, many users), but agents are one-to-one (one instance per user per task). This shift demands new infrastructure primitives — lighter-weight execution environments, new identity models, and new economic frameworks — to make agents viable at scale.

The scaling challenge

Cloudflare's analysis of agent scaling math illustrates the problem: if 100 million US knowledge workers each used an agentic assistant at 15% concurrency, that requires capacity for ~24 million simultaneous sessions. At 25–50 users per CPU, that's 500K–1M server CPUs — just for the US, with one agent per person. Multiple agents per person and global scale push this to orders-of-magnitude shortfalls in available compute.

Containers vs. isolates

Traditional container-based approaches (the current default for coding agents) give each agent a full execution environment with filesystem, git, bash, and arbitrary binary execution. This works but is expensive and slow to provision.

V8 isolates (as used by Cloudflare Workers) offer a lighter alternative:

~100x faster startup than containers (milliseconds vs. seconds)
~100x more memory-efficient — megabytes instead of hundreds of megabytes per instance
Ephemeral by design — spin up per-request, tear down immediately
Make per-unit economics viable for non-developer agent use cases

The tradeoff: isolates don't support arbitrary binaries or filesystem access, so coding agents still need containers. The future is likely a hybrid — containers for developer agents, isolates for the mass-market.

The "horseless carriage" phase

Current agent infrastructure shows classic early-adoption patterns:

Agents use headless browsers to navigate human-designed websites instead of structured protocols like MCP
Many MCP servers are thin REST API wrappers rather than rethinking the interaction model
CAPTCHAs verify "are you human?" when the real question is "which agent are you, who authorized you, and what are you allowed to do?"
Full containers are spun up for agents that only need a few API calls

New infrastructure needs

Agent identity and authorization — moving beyond human-centric auth models
Agent economics — new payment rails (e.g., HTTP 402 / x402 Foundation) since agents don't see ads or click paywalls
Publisher governance — tools for content owners to set policies for agent interactions
Security built into execution — not bolted on afterward; prompt injection, data exfiltration, and unauthorized access need to be handled at the platform level

Model routing and embeddings

Claude Desktop now appears to support third-party inference for Cowork and Code, which makes local models and OpenRouter-style backends first-class options for some desktop workflows
Gemini Embedding 2 is now generally available in Gemini API and Vertex AI, which matters because production retrieval systems increasingly need native multimodal embeddings rather than text-only vectors
The infrastructure question shifts from “which model?” to “which routing, embedding, and serving layer can absorb model churn without rewriting the app?”

Portable, tool-agnostic agent memory

A second-order infrastructure primitive emerging from the agent ecosystem is shared long-term memory that is not locked to any single tool. brain (codejunkie99/brain) is an open-source (Apache-2.0, Rust) example: it gives Claude Code, Cursor, Codex, OpenClaw, Hermes, and any MCP-capable client one shared local memory, stored as git commits in ~/.brain, indexed for search, and exposed through a CLI, TUI, and MCP server. Its design choices are notable as a pattern: memory is a plain git event log (the source of truth) with a rebuildable SQLite search index on top; sync is explicit (brain remote add / push / pull) rather than cloud-default; and onboarding writes managed prompt blocks into each agent's config files between BRAIN:START/BRAIN:END markers so re-runs do not duplicate content. The contrast with frontier-lab managed memory (e.g. Anthropic's session-consolidating "Dreaming") is the locus of control: git-backed local memory keeps provenance auditable and portable across tools, where managed memory trades that for automatic cross-session learning.

Observability for AI-built systems

As AI agents generate and maintain more software, standardized observability becomes critical — humans increasingly debug and operate systems they did not hand-write, and the instrumentation layer becomes the only reliable way to understand runtime behavior. OpenTelemetry, an open-source CNCF framework born from the merger of OpenTracing and OpenCensus, is the de facto standard: vendor-neutral instrumentation (instrument once, export anywhere), unified signals (traces, metrics, logs correlated via shared context), the OTLP wire protocol, auto-instrumentation for popular frameworks, a collector pipeline with 200+ components, and native SDKs for 12+ languages. Tracing and metrics APIs are production-stable.

The implication for agent infrastructure: any platform serving AI-built or AI-operated systems at scale needs OTel-compatible instrumentation built in, not bolted on — otherwise debuggability collapses as human authorship thins out.