With Claude Opus 4.7 joining Gemini at the 1M-token tier, long-context is no longer a differentiator. It’s a checkbox on the comparison table. The interesting question shifts from “can the model handle long inputs” (yes, obviously, congratulations) to “when is long-context the right architecture, and when is retrieval just cheaper.”
The naive answer is that retrieval is always cheaper because you only pay for the tokens you actually fetch. The real answer is, as usual, more interesting and contains more footnotes.
When long-context wins
Three workload shapes consistently favor long-context over retrieval, and they tend to be the ones generating the most agent-demo enthusiasm:
Repository-scale code understanding. When the agent needs to reason across the whole codebase to answer “where should this new function live, and what existing utilities should it use,” retrieval underperforms in the same way every time. The relevant context is structural, not lexical. Top-k similarity search misses the function you needed because it has zero token overlap with the query. Long-context fixes this by brute force: just stuff the codebase in there.
Multi-document synthesis at depth. Legal discovery, M&A diligence, regulatory comparison. The shape of the question requires the model to hold many full documents in working memory simultaneously and find cross-document patterns that no single chunk encodes. RAG can serve the answer. It cannot do the analysis.
Agentic workflows with deep tool-call history. Long-horizon agents accumulate context faster than anyone planning the architecture expected. Truncating or summarizing intermediate state loses signal that turns out to matter three steps later, when the agent has forgotten why it opened the file in the first place. Long-context lets the model keep the whole trace and reason over it.
When retrieval still wins
Almost everything else. Q&A over a knowledge base, customer-support assistants, content recommendation, structured-data lookup. Anything where the relevant context is a small fraction of total available documents.
Retrieval’s economic advantage is real and the long-context discourse is pretending it isn’t. A 1M-token call at flagship pricing runs around $3 in input tokens. Five 50K-token calls with retrieval run about $0.75. If the workload doesn’t actually need the full 1M, retrieval wins by 4x on pure inference cost, before you even start thinking about latency. Which scales near-linearly with context length, in case anyone forgot.
The hybrid that’s actually winning
The shops doing this well are running both, because the people in the room who pay the AWS bill demanded it. Default to retrieval for narrow queries. Escalate to long-context when the retrieval scores suggest the relevant context is dispersed and the answer requires synthesis. The router lives at the orchestration layer, not inside the model.
This is the architecture pattern worth watching. The orchestration layer (LangChain’s successor generation, the agent frameworks coming out of the major labs, workflow products like Vercel AI SDK) is where the next round of value capture happens. Because it’s where the routing decision lives, and the routing decision is the part with leverage.
What to track
- Per-effective-token economics. Vendors publish context-length pricing curves. Almost none publish per-effective-token comparisons that account for what fraction of the context actually contributed to the answer. That’s the number to ask for.
- Latency-aware routing. The long-context tax on latency is 5-10x at the upper end. Routers that account for latency, not just dollar cost, will hold user trust longer than the ones that don’t.
- Cache-friendly architectures. Prompt caching changes the math significantly when the long context is stable across calls (a fixed codebase, a fixed corpus). Whether cache hit rates hold in real production workflows is the next data point worth watching. Anecdotally: they don’t, as often as the vendors imply.
The capability war converged. The architecture war is just getting interesting.