Cost Optimization

RAG cost comes from two places: embedding your content (once, at ingest) and LLM calls (every chat turn). Here’s where to tune.

LLM cost

The biggest lever. Pricing per 1M tokens (input / output):

Model	Input	Output	Use for
`gpt-4o-mini`	$0.15	$0.60	Cheap generation, rewriting, summarization
`gemini-2.5-flash-lite`	$0.10	$0.40	Lowest-cost option
`gemini-2.5-flash`	$0.30	$2.50	Fast, cost-efficient
`gpt-4o`	$2.50	$10.00	Complex reasoning, tool use
`claude-haiku-4-5`	$1.00	$5.00	Lightweight tasks
`claude-sonnet-4-6`	$3.00	$15.00	Balanced

Default to gpt-4o-mini. Reserve gpt-4o / claude-sonnet for turns that genuinely need stronger reasoning.

History compaction - summarizes old turns so you don’t resend the whole transcript every message. The single biggest saver on long chats. See chat tuning.
Intent routing - skips retrieval (and its context tokens) for small talk. On by default.
Cheap rewrite model - query rewriting is short and mechanical; point rewrite_llm at gpt-4o-mini even if your main model is larger. See query rewriting.
max_context_chunks - cap how many retrieved chunks get sent to the LLM (ChatRetrievalQualityConfig).
max_tokens - cap output length on rag.llm(...).

Model choice - voyage-3-lite is cheaper than voyage-3 / voyage-3-large. Use lite for high-volume or less nuanced content.
Chunk size - bigger chunks mean fewer chunks to embed. Don’t go so big that retrieval gets noisy (see chunking).
batch_size - larger batches mean fewer API round-trips.
Embed once - store vectors; never re-embed unchanged content. Use edit() to update only what changed.

Use BM25 (provider="bm25") for a free, local reranker instead of an API call - good when exact-term matching is enough. See reranking.