The inference cost curve is the most important chart in AI

In January 2023, running GPT-4 class inference cost roughly $60 per million tokens. By mid-2024, that had dropped to $10. By early 2026, the frontier-tier cost was below $3 for many providers, with strong open-source alternatives running at fractions of that on commodity GPU hardware. That is a 20-30x reduction in 36 months, and the trajectory is not flattening.

This is the most consequential number in AI for anyone building applications. Not because inference is the only cost, but because inference cost is the constraint that shapes everything downstream: what applications become economically viable, what prompting strategies are worth trying, what user experience patterns are achievable at scale.

Where the cost came from

The inference cost drop has four sources, roughly in order of contribution.

Model efficiency. The same capability level now requires fewer parameters than it did 18 months ago. Techniques like mixture-of-experts routing, speculative decoding, and better quantization compress the computational cost of a given output quality level. GPT-3.5 class performance today runs in models that are a fraction of GPT-3.5’s original size.

Hardware economics. GPU supply has normalized relative to the 2022-2023 shortage period. Specialized inference chips (Google TPUs, AWS Trainium/Inferentia, Microsoft’s Maia) are operating at scale and at better cost-per-FLOP than the inference workload profiles of 2022. Data center utilization rates at the major cloud providers have improved.

Competition on margin. The inference market has four to six credible providers competing on price. The price compression in GPT-3.5 class and GPT-4 class models tracks with Anthropic and Google entering the same capability tier, not with raw cost changes alone. Competition is doing work here.

Caching and batching. Operational improvements — prompt caching, better request batching, speculative decoding at the serving layer — reduce the effective cost for typical production workloads even when the listed per-token price stays flat.

What the trajectory implies

If the pattern holds, GPT-4 class inference will cost under $1 per million tokens by 2027. Open-source models in the same capability range will run significantly cheaper on self-hosted or spot GPU infrastructure.

The implications cascade. Application designs that are currently constrained by per-query inference cost become unconstrained. Long-context applications that were cost-prohibitive at $7/million tokens become routine at $0.50. Agentic applications that make dozens of model calls per user action — currently carefully engineered to minimize calls — can be redesigned around richer reasoning loops.

The critical caveat: not all inference gets cheaper at the same rate. Frontier-tier capabilities (the best reasoning, multimodal, code generation) have shown less dramatic cost compression than the middle tier. The models that dropped 20x in cost are the models where competition reached a capability plateau and providers competed on margin. At the frontier, providers are still differentiated and pricing reflects that.

Where the floor is

The floor on inference cost is set by hardware costs, not software economics. A well-engineered data center running modern inference hardware has physical constraints: power, cooling, silicon cost amortized over time. At current hardware economics, the floor for cloud inference on a 70B parameter model is likely somewhere in the $0.10-0.30 per million token range, achievable within the next 24-36 months for providers operating efficiently at scale.

For smaller models (7B-13B parameter range), the floor is already nearly here. Models in this class running on spot GPU instances cost fractions of a cent per million tokens today.

What to do with this

For application architects, the inference cost curve means that applications designed around minimizing model calls are increasingly being optimized for the wrong constraint. The architectural judgment that was correct in 2023 — “minimize the number of LLM calls” — becomes less relevant as the cost per call drops. The constraint that replaces it is latency (model calls still take time) and quality (more calls can mean more opportunities for error in a reasoning chain).

For AI product strategists, the cost curve is the reason that pure inference — “we call OpenAI and resell the output” — is a structurally weak business position. Margins on resold inference compress faster than most SaaS economics assume. The durable value in AI products is at the application layer: the data, workflows, integrations, and evaluation infrastructure that turn cheap inference into reliably useful outputs.

The inference cost curve is not a given. It depends on continued hardware investment, competitive dynamics among providers, and the current trajectory of model efficiency improvements. But absent a disruption in any of those three factors, the curve continues down, and the applications that were economically impossible two years ago will be table stakes two years from now.

inferenceeconomicscomputecostscaling

Where the cost came from

What the trajectory implies

Where the floor is

What to do with this

More on this

The long-context economics question

Your biggest rival is now your landlord: Anthropic's strange new compute portfolio

Anthropic is the only frontier lab the US is trying to ban, and also the one everyone else is racing to integrate