tokenprice.co
Blog

How to cut your AI API costs by 80%

Published March 4, 20254 min read

Most teams overpay for LLM APIs by a factor of 3–10x. The reasons are predictable: defaulting to the most expensive model, sending too much context, never measuring per-feature cost, and ignoring caching.

Here are eight techniques that have repeatedly cut API bills by 50–90% in real production systems, ranked by effort-to-payoff.

1. Tier down your default model

The single biggest lever, by a wide margin. Most features built in 2023–2024 default to GPT-4 / GPT-4o / Claude Sonnet because that's what the team prototyped on. In 2025, GPT-4o mini, Claude 3 Haiku, Gemini 1.5 Flash, and DeepSeek V3 will handle 70%+ of those calls at 10–20x lower cost.

The work: 30 minutes to swap models, an afternoon to run a small eval. Typical savings: 70–90%.

2. Route by complexity

Send easy requests to the cheap tier and hard requests to the flagship tier. A simple router based on input length, intent classification, or a cheap "is this hard?" prompt is enough.

The work: a day. Typical savings: 50–80% on top of (1).

3. Use prompt caching

OpenAI and Anthropic both offer prompt caching now. If you have a long system prompt or a stable RAG context, you pay once and get up to 90% off the cached portion on subsequent calls.

The work: a few hours, mostly restructuring prompts so the stable part comes first. Typical savings: 30–80% on input cost for cached calls.

4. Shrink your prompts

Most production prompts contain 30–50% pure noise: stale instructions, redundant examples, verbose role-play, polite filler. Read every prompt out loud. Delete anything that doesn't change behaviour on your eval set.

The work: half a day. Typical savings: 20–40% input cost.

5. Cap output length

Output tokens cost 3–5x more than input. If you don't need a 2000-token response, don't ask for one. Set max_tokens aggressively. Add "respond in one paragraph" or "respond in JSON only" to the system prompt.

The work: an hour. Typical savings: 30–60% output cost.

6. Batch when latency doesn't matter

OpenAI and Anthropic both offer batch APIs at 50% off the regular price. If a workload doesn't need real-time responses (overnight processing, evals, content generation pipelines, classification of historical data), batch it.

The work: a day to plumb the batch API. Typical savings: 50% on the affected workload.

7. Cache outputs

If users ask the same questions repeatedly, cache the answers. A small Redis or Postgres-backed cache keyed on a normalised version of the input is usually enough. Even a 10% hit rate is a 10% saving.

The work: a day. Typical savings: 5–30% depending on workload repetition.

8. Measure per-feature cost

You cannot optimise what you do not measure. Tag every API call with the feature that triggered it, and aggregate cost per feature per day. The expensive features will surprise you. So will the cheap ones.

The work: half a day. Typical savings: it's the meta-lever — it makes (1) through (7) actually happen.

A concrete example

A team I worked with was spending ~$18k/month on OpenAI. The workload: a customer-support copilot with three features (suggested replies, ticket classification, summarisation).

After applying (1), (4), (5), (8) over a single week:

  • Classification moved from GPT-4o to GPT-4o mini → cost cut by 95%
  • Summarisation moved from GPT-4o to Haiku → cost cut by 90%
  • Suggested replies stayed on GPT-4o, but prompts shrunk by 40% and max_tokens capped at 200

New monthly bill: ~$2.4k. That's 87% off, with no measurable quality regression. Total engineering time: ~5 days.

What not to do

  • Don't switch to a self-hosted model "to save money" unless you are doing > 100M tokens/day. The total cost of ownership (GPUs, ops, evals, latency, on-call) is almost always higher than just paying the cheap tier of an API for at least the first year.
  • Don't fine-tune to save money. Fine-tuning saves money only if the resulting model lets you use a much cheaper base — and most teams discover the eval gap eats the savings.
  • Don't optimise prematurely. If you are spending less than $1k/month, your time is better spent on product. Come back when the bill matters.

Track your savings

Watch your provider's pricing on the tokenprice.co homepage. Prices drop again every few months — the cheap tier of 2025 will look expensive by 2026.

Related models