The 1M context window on Claude Opus 4.7 has been live for a little over a month. The retrieval-augmented generation pipelines that shops spent two years building have not been deleted. Most of them just got reassigned to the cases where the math still works.
The shape of that reassignment is becoming legible. Single document, single conversation, fewer than roughly 800 pages of dense text? The pipeline got ripped out last week. The team rewrote four functions and shipped a session that drops the whole thing in, asks the question, and returns a cited answer. The token bill went up. The latency went down. The accuracy went up enough that legal stopped flagging the agent during compliance review.
The other side is where RAG kept its job. Corpus retrieval across a million internal documents is still RAG. So is anything that needs to stay snappy enough for an interactive product where a 200K input would blow the latency budget. So is anything where the customer expects per-document citations with stable IDs, because long-context models still cite passages the way a tired graduate student cites them: gestures toward the source, paraphrases the wrong sentence, makes you check.
The pricing decision sits where you would expect. The math is whether the embed-and-retrieve cost amortized across queries beats the per-call long-context cost on the actual access pattern. Customer support over a 50,000-page knowledge base where the bulk of queries hit the same 200 pages? RAG still wins, by a lot. Single-document workflows where the same 400 pages get loaded twice and then dropped? RAG never had a shot, and the 1M context window is a quiet rebuke of the architecture choice every shop made in 2024.
The structural shift is in what gets built next. The loud use case was “load the entire repo into context.” The quiet one is that long context made prompt-cache hit rates the more interesting cost lever. Cache utilization jumped after the upgrade because shops started designing prompts to be larger and more stable, instead of smaller and dynamically composed. The win is the same one cache-hit pricing has always had: pay once to warm it, pay one tenth to re-read it. A 1M context window made that math reachable for workloads that used to be too brittle to cache.
RAG is not dead. It just stopped being the default answer to “how do I get this model to read a long document,” which was the wrong question and a very expensive way to find that out.