What is RAG? A practical guide for AI builders
Retrieval-augmented generation, explained without the marketing fog. What RAG is, how it works, when it wins, and where it fails. From the team building AI products on it daily.
- RAG (Retrieval-Augmented Generation) is the pattern where AI looks up your data before generating an answer.
- Foundation models do not "know" your data by default. RAG fixes that without retraining.
- RAG wins for: knowledge-heavy answers, frequently-changing data, citation-required outputs.
- RAG fails for: tasks that need reasoning over your data, or where retrieval quality is fundamentally hard.
Every conversation about "let's add AI to our docs" eventually lands on the same architecture. The customer thinks they want a chatbot. They actually want RAG. This post explains what RAG is, how the pieces fit, when it earns its place, and where it quietly fails.
Why models do not know your data
Foundation models like Claude, GPT, Gemini, and Llama are trained on enormous public datasets. They learn patterns of language, reasoning, and general knowledge. They do not learn your contract templates, your support history, or the internal spec of your product. Even the largest model with the longest context window cannot fit your entire knowledge base into a single conversation.
Three ways to bridge that gap exist: prompt engineering (paste the relevant data into each prompt), fine-tuning (continue training the model on your data), and retrieval-augmented generation (look up relevant data at query time and pass it to the model). RAG is usually the right starting point. For the deeper trade-off, read our RAG vs fine-tuning vs prompting post.
The RAG architecture
Below is the canonical pipeline. Every production RAG system is a variation on this shape.
- Step 1 — Ingest + chunk
- Documents (PDFs, web pages, support transcripts, code, whatever) are broken into smaller chunks of 200-1,000 tokens. Smaller chunks retrieve more precisely; larger chunks preserve more context. The right size depends on the task.
- Step 2 — Embed
- Each chunk is converted into a vector — a numerical representation of its meaning — using an embedding model. Vectors of similar meaning end up close in space.
- Step 3 — Store
- Vectors are stored in a vector database (Pinecone, Weaviate, pgvector, Chroma, Milvus, or a hosted alternative). Each vector keeps a reference to its source chunk.
- Step 4 — Retrieve
- When a question comes in, it is embedded with the same model. The vector database returns the top K closest chunks by similarity. K is usually 3-10.
- Step 5 — Augment
- The retrieved chunks are pasted into the prompt as context, alongside the user question. The model now has the relevant pieces of your data in its working memory.
- Step 6 — Generate
- The foundation model produces an answer grounded in the retrieved context. It cites which chunks it used, when prompted to. The answer is shown to the user.
Variations: hybrid retrieval (vector search + keyword search), reranking (a second pass to refine the top K), query rewriting (rewriting the user question for better retrieval), and multi-hop (retrieving in stages). Most production systems use at least two of these.
When RAG wins
RAG earns its place when:
- Your data changes frequently. Fine-tuning requires re-training every time the data drifts. RAG just updates the vector store.
- You need citations. Regulated industries — legal, healthcare, finance — need the model to point at the source. RAG returns sources naturally.
- The knowledge is large. A 10,000-page knowledge base does not fit into a single context window. RAG retrieves only what is relevant per query.
- Cost matters. Embedding + retrieval is cheaper than fine-tuning the largest models.
At Xwits, the support agents inside XWorks Suite products use RAG to answer customer questions from each partner's own documentation. The salon's chatbot answers questions about the salon's service menu — not the menu of every salon on the platform.
When RAG fails (or struggles)
RAG is not magic. It fails when:
- Retrieval quality is poor. If your data is poorly chunked, badly indexed, or has too much noise, RAG returns the wrong chunks. The model then generates a plausible-sounding wrong answer.
- The task requires reasoning across many chunks. "What is our total spend on cloud across 2025?" needs the model to aggregate hundreds of invoices. RAG cannot do that with a top-K retrieval. You need a structured query, not a language model.
- The knowledge is tacit. If the answer lives in the heads of experts and was never written down, no retrieval system will find it.
- Latency is critical. RAG adds a retrieval round-trip to every query. Sub-100ms response targets become hard.
Common pitfalls
The "we just dumped everything into a vector DB" mistake
Most failed RAG systems share this pattern. Documents are crawled, embedded, stored — and retrieval quality is terrible because the chunks are full of boilerplate, navigation, or out-of-date content. RAG quality compounds from data quality. Clean your data first.
Forgetting the human evaluator
How do you know your RAG is working? "It seems fine when I test it" is not a strategy. Build a small set of 50-100 golden questions with verified-correct answers. Test every RAG change against the set. Track precision and recall over time.
Mixing tenants
In multi-tenant SaaS, every tenant's data must be retrievable only by that tenant. Cross-tenant leakage destroys trust. The vector store needs tenant-aware filtering at query time. We do this in XWorks Core by default.
How we use RAG at Xwits
Every XWorks Suite product ships with RAG built into the support agent and the smart search. Each partner gets a per-tenant retrieval scope. Documents (your help centre, your product specs, your past communications) are ingested, chunked, and embedded automatically. The agent answers customer questions grounded in that data. Citations are clickable.
For custom builds, RAG is usually the first piece of architecture we design. The cost is low; the upside is large; the failure modes are well understood.
What this means for you
- If your AI use case starts with "answer questions from our data," default to RAG. Try fine-tuning only after RAG has hit a wall.
- Spend more time on data quality than on the model choice. A great model on bad data still gives bad answers.
- Build a golden-question set early. Measure retrieval quality, not just generation quality.
- Read our deep RAG vs fine-tuning vs prompting post for the comparative trade-offs.
Want to talk through a specific RAG architecture for your business? Book a 30-minute call. We will sketch the pipeline with you on the call.
Talk to a real engineer.
A 30-minute call. We will tell you honestly whether AI is the right fix and what it would take.



