The pattern is consistent across teams I work with: they prototype with GPT-4, the prototype works well, they push to production, and then the first billing cycle arrives. The costs are often 10–30× higher than expected because the prototype call patterns — synchronous, unbatched, uncached, using the largest available model for every task — get replicated unchanged into production.

Four techniques address most of this. None requires rebuilding the system. All of them are sequenced here by implementation effort, not by impact — because the lowest-effort changes often deliver the highest savings.

1. Model Right-Sizing

The first mistake is using the same model for everything. A flagship model is 15–60× more expensive per token than a small model, and for most structured tasks — classification, extraction, summarisation of short text — the small model is accurate enough.

Task Type	Use	Avoid
Classification (fixed categories)	gpt-4o-mini, claude-haiku	gpt-4o, claude-opus
Entity extraction from short text	gpt-4o-mini, claude-haiku	gpt-4o
Summarising documents (<2K tokens)	gpt-4o-mini	gpt-4o
Complex reasoning, multi-step tasks	gpt-4o, claude-sonnet	haiku alone
Long-form generation, creative output	gpt-4o, claude-sonnet	haiku

Run an accuracy benchmark on your own data before switching. For classification with a well-defined vocabulary and a concise system prompt, small models typically achieve 93–97% accuracy — comparable to larger models, at a fraction of the cost.

2. Prompt Compression

System prompts in production are often 3–5× longer than they need to be, because they were written iteratively during prototyping — each fix got appended rather than integrated. Every token in the system prompt is billed on every call.

The compression process:

Print the system prompt. Read it literally, not with the memory of why each line was added.
Remove anything that restates the task description rather than constraining the output.
Collapse examples: keep one representative example per output format, not five.
Run the compressed prompt against your test set. If accuracy holds, ship it.

A system prompt that starts at 800 tokens almost always compresses to under 200 without accuracy loss. At 10,000 calls per day, that's 6 million fewer tokens billed daily.

# Before (820 tokens):
"""You are a helpful AI assistant specialized in customer support
ticket classification. Your role is to analyze incoming support
tickets and categorize them into the appropriate department.
[500 more tokens of context and examples]"""

# After (180 tokens):
"""Classify support tickets. Reply with exactly one of:
BILLING, TECHNICAL, ACCOUNT, GENERAL

Examples:
"Can't log in" → TECHNICAL
"Invoice question" → BILLING"""

The "before" version tells the model what it is. The "after" version tells the model what to do. For constrained tasks, the model doesn't need the backstory.

3. Semantic Caching

Exact-match caching (storing API responses keyed by the literal prompt string) has near-zero hit rate in production because prompts vary slightly on every call. Semantic caching is different: it stores responses keyed by the meaning of the input, not the exact string.

The implementation uses embeddings:

import numpy as np
from openai import OpenAI

client = OpenAI()
cache = {} # In production: Redis with TTL

def cosine_similarity(a, b):
 return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def cached_classify(text: str, threshold: float = 0.95) -> str:
 # Get embedding for this input
 embedding = client.embeddings.create(
 input=text,
 model="text-embedding-3-small" # ~$0.00002 per call
 ).data[0].embedding

 # Check cache for semantically similar previous input
 for cached_text, (cached_embedding, cached_result) in cache.items():
 similarity = cosine_similarity(embedding, cached_embedding)
 if similarity >= threshold:
 return cached_result # Cache hit — no LLM call

 # Cache miss — call the LLM
 result = classify(text)
 cache[text] = (embedding, result)
 return result

The embedding call costs roughly 0.1% of a classification call. For workloads with recurring patterns — support tickets often cover the same 20–30 scenarios repeatedly — cache hit rates of 30–60% are common. At 60% hit rate, you've cut your LLM call volume nearly in half.

The threshold matters: 0.95 is conservative (only near-identical inputs hit). 0.90 caches more aggressively but risks serving stale results for slightly different inputs. Calibrate based on your tolerance for the occasional wrong classification.

4. Async Batching

Most API calls in support automation pipelines don't need to be synchronous. A ticket submitted at 14:03 doesn't need to be classified by 14:03:00.5 — it needs to be classified before anyone routes it, typically within a few seconds at most.

Collect calls in a small buffer (200–500ms) and submit as a batch:

import asyncio
from collections import deque

class BatchClassifier:
 def __init__(self, batch_size=20, max_wait_ms=300):
 self.queue = deque()
 self.batch_size = batch_size
 self.max_wait_ms = max_wait_ms

 async def classify(self, text: str) -> str:
 future = asyncio.get_event_loop().create_future()
 self.queue.append((text, future))

 if len(self.queue) >= self.batch_size:
 await self._flush()

 return await future

 async def _flush(self):
 batch = []
 while self.queue and len(batch) < self.batch_size:
 batch.append(self.queue.popleft())

 # Single API call for the whole batch
 results = await classify_batch([text for text, _ in batch])

 for (_, future), result in zip(batch, results):
 future.set_result(result)

Batch processing with OpenAI's batch API also unlocks a 50% cost discount on eligible models. For workloads that can tolerate up to 24-hour turnaround (nightly reporting, bulk document processing), this is the single highest-leverage cost lever available.

Combining the Techniques

Applied together on a typical classification pipeline handling several hundred calls per day:

~15×

Cost reduction from model right-sizing alone

~4×

Additional reduction from prompt compression

>90%

Total cost reduction, all techniques combined

The ordering matters: do model right-sizing first. It's the highest-impact, lowest-effort change and doesn't require touching your caching or batching infrastructure. Prompt compression second — it reduces the cost of every remaining call. Semantic caching and batching third, once the first two have stabilised.

What This Doesn't Cover

These techniques apply to classification and structured extraction tasks. For open-ended generation (drafting emails, writing summaries of long documents, creative tasks), accuracy degradation on small models is more significant and the optimisation calculus changes. Measure first; don't assume the small model is good enough for every task in your pipeline.

How to Cut AI API Costs 90% in Production (Without Losing Accuracy)

1. Model Right-Sizing

2. Prompt Compression

3. Semantic Caching

4. Async Batching

Combining the Techniques

What This Doesn't Cover

Free: AI API Cost Calculator

Related Service

AI Integration & Cost Optimisation

Related Posts

Evgeny Goncharov

AI API costs getting out of hand?

How to Cut AI API Costs 90% in Production (Without Losing Accuracy)

1. Model Right-Sizing

2. Prompt Compression

3. Semantic Caching

4. Async Batching

Combining the Techniques

What This Doesn't Cover

Free: AI API Cost Calculator

Related Service

AI Integration & Cost Optimisation

Related Posts

Evgeny Goncharov

AI API costs getting out of hand?

Weekly Automation Insights