RAG is supposed to reduce hallucinations, but many implementations replace one problem with another: expensive retrieval pipelines that stuff the model with low-signal chunks. Costs go up, answer quality barely improves, and teams blame the model when the real issue is retrieval design.
What to remember
- RAG waste usually starts before the model call, inside chunking and retrieval policy.
- More retrieved chunks do not automatically create better answers.
- Bad chunk size and poor ranking amplify both spend and noise.
- Grounded AI gets cheaper when the evidence set gets sharper.
Where RAG waste starts
The most expensive RAG systems often retrieve too much because the team does not trust ranking quality. Instead of improving retrieval, they compensate by sending more chunks downstream.
That creates two costs at once: extra retrieval work and a larger prompt for the model. It also harms answer quality because the model has to sort through more irrelevant material.
Chunking and ranking matter more than people expect
Oversized chunks create bulky prompts. Tiny chunks create noisy search results and more overhead. The right chunking strategy depends on document structure, but the general goal is stable semantic units the model can reason over without filler.
Ranking quality is just as important. If your top results are mediocre, the natural temptation is to send the top ten instead of the top three.
- Tune chunk size for semantic coherence, not just index convenience
- Remove boilerplate that repeats across documents
- Track answer quality versus number of chunks retrieved
- Use staged retrieval when uncertainty is high
The better default for real teams
Default to sharper retrieval, smaller evidence sets, and clear escalation rules. If the first pass is uncertain, then widen the search or send more context. Do not assume every question deserves the maximum prompt footprint from the start.
That one shift often improves both cost and answer quality because the model sees less noise and more intent.
Frequently asked questions
What makes RAG expensive most often?
Oversized prompts created by poor chunking, noisy retrieval, and sending too many chunks to the model.
Does retrieving more documents improve quality?
Not always. Beyond a point, it can hurt quality by adding irrelevant material and driving up cost.
What should teams optimize first in RAG?
Chunking and ranking. Better retrieval usually pays off faster than only swapping models.
RAG should improve answers, not just inflate prompts
Spendwall helps teams see where AI and cloud spend grows so retrieval-heavy systems can be tuned with budget discipline as well as product discipline.
