MCP is everywhere now — and so is its oldest constraint. How a transparent caching proxy gets any MCP server past the 25,000-token response limit.
- #mcp
- #ai-infrastructure
- #caching
- #open-source
writing
Notes from production — AI systems, performance, and the infrastructure underneath.
MCP is everywhere now — and so is its oldest constraint. How a transparent caching proxy gets any MCP server past the 25,000-token response limit.
How a 24/7 AI agent fleet stays affordable on one subscription: deterministic code handles every tick, and the model only runs on real signals.
My fleet dashboard quietly degraded to 4.18s. The cause: one COUNT(*) full-scanning 258k rows on every load. One index later: ~18ms, flat forever.
A FastAPI service on a fixed 1 vCPU went 1.68 to 69.6 RPS by adding async — before any hardware, workers, or DB tuning. A staged k6 study of throughput.
Choosing an LLM by feel ships regressions you can't see. Picking models with an eval framework instead — latency, cost, accuracy, fit — from production.
A fleet pattern for 24/7 AI agents: one agent per machine as gatekeeper, a star topology, a chat room as the bus, and a subscription instead of metered keys.
A Parquet viewer worked in curl and on github.io, then served zero rows on the custom domain. The culprit: the CDN gzipped the file, breaking Range requests.