Cloves Almeida’s Notes
Production ML Architecture 5 min read

Engineers Cannot Rely on Falling Token Prices

Frontier models are much more capable than they were a year ago, but the old assumption that token prices will keep falling is breaking. At scale, teams can no longer count on price cuts to do the inference- optimization work for them.

  • #inference
  • #memory
  • #production-ml
  • #optimization

Compared to just a year ago, frontier models feel like magic.

Many of the problems people are trying to solve with them are worth the price. Expert time is often far more expensive than an API call, and some workloads really do need the smartest model available. The problem starts when a useful demo becomes a high-volume system, because the cost side used to improve in the background through falling token prices. That is no longer a safe assumption.

Recent launches make the shift hard to ignore. Google launched Gemini 3.5 Flash at three times the price of Gemini 3 Flash Preview. OpenAI launched GPT-5.5 at twice the price of GPT-5.4. Anthropic kept Opus 4.7 nominally flat, but its migration guide says the same text can use up to 35% more tokens. Same price per token is not the same price per task.

LabLatest modelInput ($/Mtok)Output ($/Mtok)
AnthropicClaude Opus 4.7$5.00$25.00
GoogleGemini 3.5 Flash$1.50$9.00
OpenAIGPT-5.5$5.00$30.00

The memory story is pointing in the same direction. On SK hynix’s Q1 2026 earnings call, Park Jun-deok of DRAM Marketing called the price environment “not a temporary supply-demand imbalance but a structural change in the market” (English coverage, AI-translated from Korean, but the substance lines up across reports). Microsoft confirmed it from the buyer side a week later. On the Q3 FY2026 call, Amy Hood said the company expects to “remain constrained at least through 2026,” and that $25 billion of Microsoft’s 2026 capex is attributable to “higher component pricing” alone. That does not prove memory costs caused API prices to move. The public evidence is weaker but still useful: memory suppliers and a major buyer are describing sustained cost pressure at the same time model pricing has stopped falling. That combination makes the old assumption of automatic cost declines much harder to defend.

The simplest first move is routing: classify each call by what it actually needs, then send the easy ones to a smaller model. Mini, Haiku, and Flash-class models exist for a reason. This assumes you have an eval set. You do have an eval set, right? Without one, routing is guesswork, not optimization. If your eval set can prove that a request does not need a flagship model, route it.

The harder optimizations still matter, but they are in a different category: distilling task-specific small models from frontier outputs, quantizing serving deployments, fine-tuning a model on your task-specific data, or porting inference to non-HBM specialized hardware like Groq or Qualcomm. These projects need engineering time, production evals, and maintenance. A year ago, waiting for the next price cut was often the rational move. Now more of the savings have to come from architecture.

The question is not whether frontier models are worth using. Many are worth every token. The question is where they are worth using, where a cheaper path is already good enough, and whether your inference plan still assumes a price curve that is no longer showing up.