Skip to content
GenAIMay 20, 2026·12 min read

RAG, fine-tuning, or custom model — a decision framework

When does retrieval beat fine-tuning? When does fine-tuning beat both? A practical decision framework with the criteria that actually matter — cost, latency, freshness, and skill required.

By PCCVDI Engineering

Every engagement that begins with “we want to build an LLM for our company” runs into the same fork within two weeks. RAG, fine-tuning, or a custom model? The wrong answer wastes six months of engineering capacity. The right answer is almost never “the latest thing.”

Below is the decision framework we walk clients through, distilled from twenty-plus engagements where the choice was non-obvious. It is opinionated. Treat the criteria as inputs, not gates — most real systems end up combining two of the three.

The three options, plainly

RAG (Retrieval-Augmented Generation). Keep an off-the-shelf foundation model. Build a retrieval pipeline over your data. At query time, fetch the relevant chunks and inject them into the prompt. The model synthesises an answer from the retrieved context.

Fine-tuning. Take an off-the-shelf foundation model and update a subset of its weights using your data. The result is a new model checkpoint that has absorbed the patterns, terminology, or style of your data.

Custom model. Train a model from scratch (or from a smaller base) on your own data. Practically zero clients need this in 2026. We include it for completeness; the rest of the article focuses on the first two.

The five dimensions that actually decide

1. How fresh does the knowledge need to be?

If your model needs to know what happened this morning, fine-tuning loses immediately. Re-training takes hours to days; RAG updates the moment a document hits the index.

Almost all enterprise knowledge use cases — policy lookup, customer support, document Q&A, internal search — fall on the RAG side of this dimension. Anything reference-style, evolving, or audit-traced wants RAG.

2. What is the unit of competence — knowledge or behaviour?

This is the under-explained dimension that most teams miss. RAG injects knowledge: facts the model can quote and synthesise. Fine-tuning teaches behaviour: how the model speaks, structures, classifies, or reasons about a domain.

If the gap between your prompt and a good answer is “the model doesn’t know our products,” you have a knowledge problem — RAG. If the gap is “the model keeps responding in the wrong format, or in a style our brand voice forbids,” you have a behaviour problem — fine-tune. If both, you do both.

3. How much labelled data do you have?

Fine-tuning a modern foundation model takes anywhere from a few hundred to a few thousand high-quality examples for behaviour adaptation (SFT, DPO). Below 200 examples, you are not fine-tuning so much as overfitting; few-shot prompting will outperform you.

RAG needs no labelled data to start. It needs source data that is clean enough to retrieve. The labelled-data cost shows up later, in the evaluation harness — you still need a 200–500 query/answer golden set to prove the system works.

4. What is the latency and cost profile?

RAG at scale has a real cost: an embedding model, a vector store, a reranker, and a base LLM hit per query. Mid-range numbers in 2026: 80–200 ms latency, $0.005–$0.02 per query, depending on retrieval volume and model tier.

A fine-tuned model — especially a smaller one (7B–13B) — can be cheaper and faster per query because it does not need retrieval. But it also has a fixed floor cost in GPU spend if you self-host, and it requires re-tuning whenever the source material drifts.

Rule of thumb: at high QPS with stable knowledge, a fine-tuned smaller model can beat RAG on cost-per-query by 5–10×. At low QPS with churning knowledge, RAG wins.

5. What does your audit and governance posture require?

Regulated industries care intensely about citation. Every claim the model makes should be traceable to a source the regulator can inspect. RAG bakes this in: every answer is grounded in a retrieved chunk you can point at. A fine-tuned model knows things, but it does not remember where from.

If your auditor will ever ask “where did the model get that?” — pick RAG. If you can defensibly point to the training corpus and that satisfies the audit — fine-tuning is open to you.

The combinations that actually ship

After 30 production AI projects, here are the four patterns we see most often:

  1. Pure RAG on a general-purpose model. Most knowledge assistants, support copilots, document Q&A. Off-the-shelf model, hybrid retrieval, reranker, citation surface. Ship in 6–10 weeks. Cheapest and fastest. Right answer for ~60% of cases.
  2. Fine-tuned smaller model + RAG. When the use case demands consistent format, brand voice, or industry vocabulary, but the knowledge churns. Fine-tune a 7B–13B model for behaviour, RAG for knowledge. Ship in 10–16 weeks. ~25% of cases.
  3. Fine-tuned model only. Classification, structured extraction, summarisation, or any task where the “knowledge” is the dataset itself and re-training is acceptable. Ship in 8–12 weeks. ~10% of cases.
  4. Multi-model orchestration. A router decides which of several specialised models handles each query — some RAG, some fine-tuned, some general. Ship in 14–24 weeks. The right answer for <5% of cases, and most teams underestimate the operational complexity.

The trap: defaulting to fine-tuning because it sounds harder

We see this regularly. A team buys fine-tuning because it “sounds like proper engineering” — versus RAG, which “is just chunking and prompting.” Three months later they have a model that knows last quarter’s catalogue and cannot answer questions about anything that has changed since.

Fine-tuning has its place. It is rarely the first move. Start with RAG, prove the value, instrument the system, then identify the specific deficiencies that fine-tuning would fix. Most teams discover that the behaviour they thought they needed to fine-tune can be solved with a better prompt and a stricter eval suite.

Decision scorecard

For your next use case, score it on each dimension and tally:

DimensionScore for RAGScore for fine-tuning
Knowledge changes often+2−2
Need citations / source links+2−1
Behaviour or style adaptation−1+2
Have ≥1000 high-quality labelled examples0+1
Have <200 labelled examples+1−2
Need sub-100ms latency at high QPS−1+1
Sensitive to per-query cost at scale−1+1
Source data sits in many documents+2−1
Closed-format output (e.g. JSON schema)0+1
Regulator might inspect any decision+2−1

Score > 0 for RAG: start with RAG. Score > 0 for fine-tuning and < 0 for RAG: start with fine-tuning. Both positive: combine. Both negative: revisit whether AI is the right approach for this use case at all.


The right architecture for any AI feature is whatever lets you ship, measure, and iterate fastest with acceptable cost. RAG wins this trade more often than any other pattern in 2026 — not because it is the best architecture in isolation, but because it is the one most likely to give you a working system in 8 weeks and a clear path to better systems after that.

Subscribe

Get new articles, the moment they ship.

One email when a new PCCVDI insights post lands. No marketing sequences, no daily roundups, no shared lists. Unsubscribe in one click.

Or grab the RSS feed — same content, no email required.

Ready to start

Turn one AI use case into measurable production value.

Book a 30-minute consultation. We will walk through the use case, sketch the value case, and tell you honestly whether we can help.