Augmenting vs Training Large Language Models

One of the most consequential decisions in any AI project is whether to augment an existing Large Language Model or train (or retrain) one. The wrong choice can cost months of engineering effort, hundreds of thousands of dollars in compute, or — worse — deliver a system that doesn’t actually solve the problem.

This post provides a thorough comparison of every major approach on the augmentation-to-training spectrum, with honest pros and cons, real-world cases, and clear recommendations for when to reach for each one.


The Spectrum: Augmentation ↔ Training

It helps to think of the available strategies not as a binary choice but as a spectrum — from lightweight, inference-time techniques that leave model weights untouched, all the way to training a model from scratch.

Strategy Model Weights Change? Data Required Compute Cost
Prompt Engineering No None Near zero
Retrieval-Augmented Generation (RAG) No Document corpus Low–Medium
Tool Use / Function Calling No Tool definitions Low
Knowledge Distillation Yes (student model) Teacher outputs Medium–High
Parameter-Efficient Fine-Tuning (LoRA, QLoRA) Partially (adapters) 1k–10k examples Moderate
Full Fine-Tuning Yes (all weights) 10k–100k+ examples High
Continued Pre-Training (Domain-Adaptive) Yes (all weights) Billions of tokens Very High
Pre-Training from Scratch Yes (all weights) Trillions of tokens Extreme

The further right you go, the deeper the model’s knowledge changes — but also the higher the cost, complexity, and risk. The central thesis of this guide is: start from the left, move right only when you have evidence that lighter approaches aren’t sufficient.


1. Prompt Engineering (Augmentation)

The simplest strategy: craft the input to steer the model’s behaviour without changing its weights. Techniques include zero-shot, few-shot, chain-of-thought (CoT), self-consistency, tree-of-thought, and system/role prompts.

How It Works

You supply instructions, examples, or reasoning scaffolds directly in the prompt. The model generates responses conditioned on that input — no weight updates, no pipelines, no infrastructure.

System: You are a financial analyst. When the user provides a company's
quarterly earnings, produce a structured analysis with: Revenue Trend,
Margin Analysis, Risk Factors, and Outlook. Use bullet points.

User: Here are ACME Corp's Q3 2025 earnings: revenue $2.1B (+8% YoY),
gross margin 42% (down from 45%), operating expenses up 12%...

Pros

  • Zero infrastructure — Works with any model API out of the box.
  • Fastest iteration cycle — Change the prompt, test immediately.
  • No data requirements — No labelled datasets needed.
  • Composable — Stack techniques: few-shot + CoT + role prompts in a single query.
  • No risk of model degradation — The base model’s capabilities remain intact.

Cons

  • Context window ceiling — Prompts compete with the actual query for token budget; long instructions leave less room for the answer.
  • Fragile — Output quality can shift drastically with minor wording changes or across model versions.
  • No new knowledge — The model is limited to what’s in its training data plus the prompt.
  • Diminishing returns — Beyond a certain complexity threshold, no amount of prompt engineering can compensate for missing domain knowledge.
  • Hard to maintain — A production system with dozens of carefully tuned prompts becomes a maintenance burden.

Real-World Use Cases

  • Customer support triage — Classify tickets by urgency and department with few-shot prompts (Intercom, Zendesk).
  • Code review — System prompts instruct the LLM to review for security, performance, and style (GitHub Copilot).
  • Structured extraction — CoT prompts extract entities from legal documents at Klarity.

When to Use

Always start here. Prompt engineering is the baseline for every LLM project. Move to other approaches only after measuring that prompt engineering alone isn’t meeting your accuracy, consistency, or freshness requirements.


2. Retrieval-Augmented Generation — RAG (Augmentation)

RAG grounds the model in external knowledge by retrieving relevant documents at inference time and injecting them into the context. The model’s weights stay frozen.

How It Works

  1. Index — Chunk your document corpus and generate vector embeddings (OpenAI text-embedding-3-large, Cohere Embed, open-source BGE/E5).
  2. Retrieve — At query time, embed the question and fetch the top-k most similar chunks from a vector database (Pinecone, Weaviate, pgvector, Qdrant).
  3. Augment — Inject the retrieved chunks into the prompt.
  4. Generate — The LLM answers using the retrieved context.
System: Answer the user's question using ONLY the provided context.
Cite the source document for each claim. If the context doesn't
contain the answer, say "I don't have enough information."

Context:
[Doc 1 — Internal Policy v4.2]: "Employees are entitled to 25 days
of annual leave, increasing to 28 days after 5 years of service..."
[Doc 2 — HR FAQ]: "Carry-over of unused leave is capped at 5 days..."

User: How many vacation days do I get after 6 years at the company?

Pros

  • Always current — Update the index and the model instantly “knows” new information without retraining.
  • Works with private data — Internal wikis, proprietary databases, customer records — none of it needs to be in the training set.
  • Auditable and citable — You can show users the exact source documents that informed the answer.
  • Reduces hallucinations — Grounding in retrieved facts significantly lowers fabrication rates (Shuster et al., 2021).
  • Model-agnostic — Works with any LLM; you can swap the underlying model without rebuilding the retrieval layer.

Cons

  • Retrieval quality is the bottleneck — Poor retrieval (wrong chunks, missed relevant docs) directly degrades answer quality.
  • Chunking is an art — Splitting documents at wrong boundaries (mid-paragraph, mid-table) destroys context.
  • Latency overhead — The embed → search → rerank → generate pipeline adds 200–500ms.
  • Infrastructure cost — Requires a vector database, embedding pipeline, and often a reranking model.
  • Context window pressure — Injecting many chunks leaves less room for the conversation.
  • Doesn’t change model reasoning — If the model struggles with a task even given perfect context, RAG won’t help.

Real-World Use Cases

  • Enterprise searchGlean and Notion AI answer questions over company knowledge bases.
  • Legal researchHarvey AI retrieves relevant case law to assist lawyers drafting briefs.
  • HealthcareHippocratic AI retrieves clinical guidelines for evidence-based responses.
  • Customer supportKlarna’s AI assistant answers billing questions grounded in account data.

When to Use

Choose RAG when the model needs access to knowledge not in its training data — proprietary documents, frequently updated content, or data too large to fit in a prompt. RAG is the most common augmentation strategy in production today.


3. Tool Use / Function Calling (Augmentation)

Give the LLM the ability to invoke external tools — APIs, databases, calculators, code interpreters — to retrieve live data or perform actions it cannot do on its own.

How It Works

The model receives a list of tool definitions. When it decides a tool is needed, it emits a structured function call. Your orchestration layer executes the call and returns the result to the model.

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "query_inventory",
        "description": "Check real-time product inventory by SKU",
        "parameters": {
          "type": "object",
          "properties": {
            "sku": { "type": "string" },
            "warehouse": { "type": "string", "enum": ["US-East", "US-West", "EU"] }
          },
          "required": ["sku"]
        }
      }
    }
  ]
}

Pros

  • Live data — Access real-time information (stock prices, weather, inventory) instead of stale training data.
  • Real-world actions — Send emails, create Jira tickets, execute trades, update databases.
  • Accurate computation — Offload math and data transforms to deterministic tools.
  • Composable — Combine tools into multi-step workflows.

Cons

  • Security risk — Prompt injection can trick the model into calling tools with malicious parameters.
  • Latency — Each tool call is a round-trip; multi-step chains compound delay.
  • Error handling — The model must cope with API failures, rate limits, and unexpected responses.
  • Model capability dependent — Smaller models struggle with reliable tool selection and parameter extraction.

Real-World Use Cases

  • ChatGPT Plugins / GPT Actions — OpenAI’s system lets ChatGPT call Expedia, Wolfram Alpha, Zapier, and hundreds of APIs.
  • Coding assistants — GitHub Copilot uses tool calls to read files, run commands, and search code.
  • Data analysis — Code Interpreter executes Python in a sandbox to analyze CSVs and generate charts.

When to Use

Choose tool use when the LLM needs to interact with the external world — query live data, perform calculations, or take actions. It’s essential for any assistant that goes beyond static text generation.


4. Knowledge Distillation (Hybrid)

Knowledge distillation transfers knowledge from a large, capable teacher model to a smaller, cheaper student model. The student’s weights change, but you don’t need human-labelled data — the teacher’s outputs serve as labels.

How It Works

  1. Run the teacher model (e.g., GPT-4, Claude 3.5) on a large set of prompts to generate high-quality outputs.
  2. Fine-tune a smaller student model (e.g., Llama-3-8B, Mistral-7B, Phi-3) on these (prompt, teacher-output) pairs.
  3. The student learns to mimic the teacher’s behaviour at a fraction of the inference cost.
# Simplified distillation pipeline
from datasets import Dataset
from transformers import AutoModelForCausalLM, Trainer

# 1. Generate teacher labels
teacher_outputs = [call_gpt4(prompt) for prompt in prompts]
dataset = Dataset.from_dict({"prompt": prompts, "completion": teacher_outputs})

# 2. Fine-tune student on teacher outputs
student = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
trainer = Trainer(model=student, train_dataset=dataset, ...)
trainer.train()

Pros

  • Cost reduction at inference — A distilled 7B model can serve requests at 10–50× lower cost than a 400B+ teacher.
  • Latency reduction — Smaller models respond faster, critical for real-time applications.
  • No human labelling required — The teacher model generates training data automatically.
  • Deployment flexibility — The student can run on-premise, on edge devices, or in air-gapped environments.

Cons

  • Capability ceiling — The student can never exceed the teacher; it typically achieves 85–95% of teacher quality.
  • License restrictions — Many model providers (OpenAI, Anthropic) prohibit using their outputs to train competing models — check terms of service.
  • Domain drift — If the teacher’s outputs don’t cover your domain well, the student inherits those gaps.
  • Still requires fine-tuning infrastructure — You need GPUs, training scripts, and evaluation pipelines.
  • Quality control — Teacher errors propagate to the student; there’s no human verification step by default.

Real-World Use Cases

  • Alpaca — Stanford’s Alpaca distilled GPT-3.5 into a Llama-7B model for $600, demonstrating that distillation can produce capable instruction-following models cheaply.
  • VicunaVicuna-13B was trained on ShareGPT conversations (user-shared ChatGPT dialogues), achieving ~90% of ChatGPT quality.
  • Orca 2Microsoft Orca 2 distilled reasoning capabilities from GPT-4 into a 13B model, with structured explanation traces.
  • Production cost optimization — Companies distil GPT-4 quality into smaller models for high-volume, latency-sensitive endpoints (chatbots, autocomplete).

When to Use

Choose distillation when you need near-frontier quality at a fraction of the cost and you have access to a strong teacher model. It’s particularly effective for high-volume production workloads where inference cost dominates, or when you need to deploy on-premise. Always verify that teacher model terms of service permit this use.


5. Parameter-Efficient Fine-Tuning — PEFT (Training — Lightweight)

PEFT methods update only a small fraction of the model’s parameters, achieving domain adaptation at dramatically lower cost than full fine-tuning. The most popular technique is LoRA (Low-Rank Adaptation) and its quantized variant QLoRA.

Variants

Technique Parameters Updated Data Needed Compute
LoRA Low-rank adapter matrices (~0.1–1% of total) 1k–10k examples Single GPU (A100/H100)
QLoRA Same as LoRA, but base model quantized to 4-bit 1k–10k examples Single consumer GPU (RTX 4090)
DoRA Decomposed weight + direction adapters 1k–10k examples Single GPU
Prefix Tuning Learned prompt embeddings prepended to layers 500–5k examples Very low
IA3 Learned rescaling vectors 500–5k examples Very low

How It Works (LoRA)

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
lora_config = LoraConfig(
    r=16,                              # rank of the low-rank matrices
    lora_alpha=32,                     # scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 8,030,261,248 || trainable%: 0.05

Pros

  • Low compute cost — QLoRA can fine-tune a 70B model on a single 48GB GPU; LoRA fits 7–13B models on a single A100.
  • Fast training — Hours instead of days or weeks.
  • Minimal catastrophic forgetting — Because most weights are frozen, the model retains general capabilities.
  • Swappable adapters — Train multiple LoRA adapters for different tasks and hot-swap them on the same base model at inference time.
  • Open ecosystemHugging Face PEFT makes LoRA/QLoRA accessible with a few lines of code.

Cons

  • Capability ceiling — LoRA adapters can’t fundamentally change what the model knows; they adjust how it behaves. Deep factual knowledge still depends on the base model.
  • Hyperparameter sensitivity — Rank r, lora_alpha, target modules, and learning rate all significantly impact quality and require tuning.
  • Data quality matters — 1,000 high-quality examples beats 50,000 noisy ones; data curation is critical.
  • Evaluation complexity — You need domain-specific benchmarks to measure whether PEFT actually improved the model.
  • Adapter management — In production, managing multiple adapters, versioning, and A/B testing adds operational complexity.

Real-World Use Cases

  • Domain-specific chatbots — Companies fine-tune open-source models (Llama, Mistral) with LoRA on customer interaction data for on-brand responses.
  • Instruction tuningAlpaca-LoRA demonstrated that LoRA can replicate instruction-following capabilities on consumer hardware.
  • Medical NLP — Researchers fine-tune clinical LLMs with LoRA on de-identified medical records for note summarization and coding.
  • Multilingual adaptation — LoRA adapters trained on language-specific corpora extend English-centric models to new languages efficiently.

When to Use

Choose PEFT when you need the model to learn domain-specific behaviour or output styles and you have moderate training data (1k–10k examples), but don’t have the budget or data for full fine-tuning. It’s the sweet spot for most production fine-tuning use cases.


6. Full Fine-Tuning (Training — Heavy)

Full fine-tuning updates every parameter in the model on your domain-specific dataset. It’s the most powerful form of adaptation — and the most expensive.

How It Works

Starting from a pre-trained checkpoint, you continue the standard training loop (forward pass, loss computation, backpropagation, weight update) on your labelled dataset. All layers are unfrozen and updated.

from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
# All parameters are trainable — no freezing, no adapters

training_args = TrainingArguments(
    output_dir="./finetuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    bf16=True,
    deepspeed="ds_config_zero3.json",  # distributed training
)
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()

Pros

  • Maximum adaptation — The model can deeply internalize domain terminology, reasoning patterns, output formats, and factual knowledge.
  • Consistent, reliable outputs — Fine-tuned models produce outputs in the exact structure and tone you need without elaborate prompts.
  • Reduced prompt size — Behaviour is “baked in”; less instruction is needed at inference time, saving tokens and cost.
  • Full control — You own the model, its weights, and its behaviour. No API dependency.

Cons

  • Very expensive — Full fine-tuning of a 70B model requires multi-node GPU clusters (8×H100 or more) for days. Costs range from thousands to hundreds of thousands of dollars.
  • Massive data requirements — High-quality, labelled data at scale (10k–100k+ examples) is essential. Poor data quality leads to poor models.
  • Catastrophic forgetting — The model can lose general capabilities when overfitted to a narrow domain (Kirkpatrick et al., 2017).
  • Maintenance burden — When the base model is updated or your domain data evolves, you must retrain.
  • Evaluation is hard — You need robust, domain-specific benchmarks to know if fine-tuning helped — or hurt.

Real-World Use Cases

  • BloombergGPT — Bloomberg trained a 50B-parameter model on financial data for sentiment analysis, NER, and financial Q&A.
  • Med-PaLM 2 — Google fine-tuned PaLM 2 on medical datasets, achieving expert-level performance on medical licensing exams.
  • Code LLMsStarCoder, DeepSeek-Coder, and Codex are fine-tuned on massive code corpora.
  • Brand voice and compliance — Enterprises with strict output requirements fine-tune models to match internal style guides and regulatory standards.

When to Use

Choose full fine-tuning when you need the model to deeply change its knowledge or behaviour and you have substantial, high-quality data plus the compute budget to support it. It’s warranted when lighter methods (prompting, RAG, PEFT) leave a measurable performance gap.


7. Continued Pre-Training / Domain-Adaptive Pre-Training (Training — Deep)

Continued pre-training (CPT) resumes the unsupervised pre-training process on a large, domain-specific text corpus, updating all model weights. Unlike fine-tuning on task-specific (prompt, completion) pairs, CPT trains on raw text using the standard next-token prediction objective.

How It Works

  1. Collect a large domain corpus (millions to billions of tokens) — medical literature, legal filings, financial reports, codebases.
  2. Resume pre-training from an existing checkpoint using the causal language modelling objective.
  3. Optionally follow up with supervised fine-tuning or RLHF to make the model instruction-following.
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
domain_corpus = load_dataset("text", data_files="medical_corpus/*.txt")

training_args = TrainingArguments(
    output_dir="./domain-pretrained",
    num_train_epochs=1,            # usually 1–3 passes over the corpus
    per_device_train_batch_size=8,
    learning_rate=1e-5,            # lower LR than from-scratch pre-training
    bf16=True,
    deepspeed="ds_config_zero3.json",
)
trainer = Trainer(model=model, args=training_args, train_dataset=domain_corpus)
trainer.train()

Pros

  • Deep domain fluency — The model learns domain vocabulary, idioms, reasoning patterns, and factual knowledge at the weight level — not just surface-level pattern matching.
  • Unlabelled data is sufficient — You need raw domain text, not expensive labelled (prompt, completion) pairs.
  • Foundation for downstream tasks — A domain-pretrained model is a better starting point for subsequent fine-tuning than a generic model.
  • Better than RAG for deep reasoning — When the model needs to internalize domain logic (not just recall facts), CPT is more effective than retrieving documents at inference time.

Cons

  • Very expensive — Training on billions of tokens requires multi-node GPU clusters for days to weeks.
  • Data volume — You need millions to billions of tokens of domain text; many domains don’t have that much.
  • Catastrophic forgetting risk — Extended pre-training on a narrow domain can degrade general capabilities unless carefully managed (data mixing, learning rate scheduling).
  • Hard to evaluate — Measuring the quality of unsupervised pre-training requires downstream task benchmarks.
  • Long iteration cycles — Each experiment takes days, making rapid iteration impossible.

Real-World Use Cases

  • PMC-LLaMAPMC-LLaMA continued pre-training of LLaMA on 4.8M PubMed Central biomedical papers, significantly improving medical reasoning.
  • SaulLM-7BSaulLM continued pre-training on 30B tokens of legal text for legal NLP tasks.
  • Finance — Firms continue pre-training on proprietary trading data, earnings transcripts, and SEC filings to build domain-fluent models.
  • CodeCode Llama was produced by continuing pre-training of Llama 2 on 500B tokens of code.

When to Use

Choose continued pre-training when your domain has a large body of specialized text that is substantially different from the base model’s training data, and you need the model to develop deep fluency — not just retrieve facts. It’s the right choice for domains like medicine, law, and finance where the model must internalize domain-specific reasoning, not just look things up.


8. Pre-Training from Scratch (Training — Maximum)

Build a model from a randomly initialized state, training on trillions of tokens. This is how foundation models like GPT-4, Claude, Gemini, and Llama are created.

How It Works

  1. Define the model architecture (transformer variant, parameter count, context length).
  2. Curate a massive, diverse pre-training corpus (trillions of tokens from the web, books, code, etc.).
  3. Train using the next-token prediction objective on thousands of GPUs for weeks to months.
  4. Follow up with supervised fine-tuning and RLHF/DPO alignment.

Pros

  • Total control — You define the architecture, training data, safety properties, and capabilities from the ground up.
  • No inherited biases — You choose the training data, avoiding biases or limitations from someone else’s model.
  • Competitive advantage — A proprietary foundation model is a strategic asset that doesn’t depend on external providers.
  • Optimal architecture — You can design the model specifically for your use case (mixture-of-experts, long context, multi-modal).

Cons

  • Astronomical cost — GPT-4-class models cost $50–100M+ to train. Even smaller models (7B) cost $100k–$1M+ in compute.
  • Massive data requirements — You need trillions of curated, deduplicated, high-quality tokens.
  • Time — Months of calendar time, even with thousands of GPUs.
  • Team — Requires a dedicated team of ML researchers and infrastructure engineers.
  • Risk — A training run that fails or produces a bad model wastes enormous resources.
  • Rapidly diminishing advantage — Open-source models improve so fast that a custom model can become obsolete before it ships.

Real-World Use Cases

  • OpenAI (GPT-4) — Trained from scratch for general-purpose intelligence across text, code, and reasoning.
  • Anthropic (Claude) — Trained from scratch with a focus on safety and Constitutional AI alignment.
  • Meta (Llama 3) — Trained from scratch and released as open-source, creating an entire ecosystem.
  • Bloomberg (BloombergGPT) — A rare example of a domain-specific from-scratch model, trained on a mix of general and financial data.
  • Government/Defense — Organizations in regulated industries train from scratch to ensure data sovereignty and auditability.

When to Use

Pre-training from scratch is justified only when you cannot achieve your goals by building on an existing model — whether due to licensing restrictions, data sovereignty requirements, need for a novel architecture, or strategic reasons. For the vast majority of organizations, fine-tuning or augmenting an existing model is a better path.


9. RAFT — Retrieval-Augmented Fine-Tuning (Best of Both Worlds)

RAFT sits at the intersection of augmentation and training. You fine-tune the model specifically on how to use retrieved documents to answer questions, teaching it to distinguish relevant from irrelevant context.

How It Works

  1. Generate a training set of (question, retrieved documents, answer) triples.
  2. Include both oracle (relevant) and distractor (irrelevant) documents in the context.
  3. Fine-tune the model to produce chain-of-thought answers that cite the relevant documents while ignoring distractors.

Pros

  • Best of both worlds — Combines domain adaptation from fine-tuning with freshness and auditability from RAG.
  • Robust to noisy retrieval — The model learns to ignore irrelevant retrieved documents.
  • Higher accuracyRAFT outperforms both standalone RAG and standalone fine-tuning on domain Q&A benchmarks.
  • Citation quality — The model learns to cite specific passages, improving trustworthiness.

Cons

  • Double complexity — Requires both a retrieval pipeline AND a fine-tuning pipeline.
  • Data engineering overhead — Creating realistic (question, context, answer) triples with appropriate distractors is labour-intensive.
  • Double maintenance — You must keep both the retrieval index and the fine-tuned model up to date.
  • Higher barrier to entry — Requires expertise in both RAG systems and model training.

Real-World Use Cases

  • Enterprise document Q&A — Organizations where vanilla RAG produces too many errors on complex documents (contracts, technical specifications).
  • Regulatory compliance — Financial institutions that need high-accuracy, citation-backed answers over regulatory corpora.
  • Medical literature — Clinical decision support systems that must accurately synthesize evidence from retrieved studies.

When to Use

Choose RAFT when you’ve already implemented RAG but the model struggles with noisy retrieval or doesn’t reason well over retrieved documents. It’s the “advanced RAG” for teams with fine-tuning capability.


Head-to-Head Comparison

Dimension Augmentation (Prompt/RAG/Tools) Training (Fine-Tune/Pre-Train)
Time to production Hours to weeks Weeks to months
Compute cost Low (inference only) Medium to extreme
Data required None to document corpus Thousands to trillions of examples
Knowledge freshness Real-time (RAG, tools) Frozen at training time
Auditability High (retrieved sources visible) Low (knowledge is in weights)
Depth of adaptation Shallow (surface behaviour) Deep (internalized knowledge)
Risk of model degradation None Catastrophic forgetting possible
Maintenance Update docs/tools Retrain model
Vendor lock-in Moderate (API-dependent) Low (own the model)
Team expertise needed ML engineering ML research + infrastructure

Decision Framework

Use this flowchart to decide where to start on the augmentation-to-training spectrum:

1. Is the base model already good enough?

Test with prompt engineering. If the model produces acceptable results with well-crafted prompts, stop here. Many teams over-engineer solutions when a good prompt would suffice.

2. Does the model need knowledge it doesn’t have?

  • Knowledge changes frequently or is privateRAG
  • Knowledge is static and you have labelled examplesFine-Tuning (PEFT)
  • Both — and RAG alone isn’t accurate enough → RAFT

3. Does the model need to take actions or access live data?

  • Single tool calls → Tool Use / Function Calling
  • Complex multi-step workflows → Agentic Workflows (see the previous post)

4. Does the model need to deeply understand a specialized domain?

  • You have labelled (input, output) pairsFull Fine-Tuning or PEFT
  • You have large volumes of unlabelled domain textContinued Pre-Training
  • You need frontier-model quality but at lower cost → Knowledge Distillation

5. Do you need total control over the model?

  • Data sovereignty, novel architecture, or strategic reasons → Pre-Training from Scratch
  • Otherwise → Build on existing open-source or API models

6. Still not sure?

Start with augmentation. Measure. Add training only when measurements prove augmentation is insufficient.


Common Anti-Patterns

Avoid these frequently observed mistakes:

❌ “We need to fine-tune” (before trying prompting or RAG)

Fine-tuning is expensive and slow. Many teams jump to it before testing whether a well-designed RAG pipeline would solve the problem. As Chip Huyen notes, prompt engineering and retrieval should be your first line of attack.

❌ Using RAG when the model needs to learn reasoning

RAG provides facts, not skills. If the model needs to learn a new reasoning pattern (e.g., a specific diagnostic protocol, a proprietary scoring algorithm), fine-tuning is the right tool — RAG will just show it the protocol without teaching it to follow it.

❌ Fine-tuning for freshness

If the goal is keeping answers up to date, fine-tuning is the wrong approach — the moment your data changes, the model is stale. RAG handles freshness; fine-tuning handles depth.

❌ Pre-training from scratch instead of continued pre-training

Unless you have a specific architectural reason or data sovereignty need, continued pre-training on an existing open-source model is almost always more cost-effective than starting from zero.

❌ Neglecting evaluation

Every approach requires rigorous evaluation. Without domain-specific benchmarks, you can’t know whether your augmentation or training actually improved things — or made them worse.


Combining Approaches: The Production Stack

In practice, production AI systems layer multiple approaches. Here’s what a sophisticated enterprise AI assistant might look like:

  1. Base model — A continued-pretrained or fine-tuned model specialized for the domain.
  2. RAG layer — Real-time retrieval over internal docs for up-to-date knowledge.
  3. Tool calling — Integrations with internal APIs for live data and actions.
  4. Guardrails — Output validation, content filtering, and structured output enforcement.
  5. Memory — Persistent user context across sessions.
  6. Evaluation — Continuous monitoring with domain-specific benchmarks and human feedback loops.

The key insight is that augmentation and training are not mutually exclusive — they’re complementary layers in a well-designed system.


References & Further Reading

Foundational Papers

  1. RAG — Lewis, P. et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, NeurIPS 2020.
  2. Chain-of-Thought — Wei, J. et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, NeurIPS 2022.
  3. Self-Consistency — Wang, X. et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, ICLR 2023.
  4. Tree of Thoughts — Yao, S. et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, NeurIPS 2023.
  5. LoRA — Hu, E.J. et al. “LoRA: Low-Rank Adaptation of Large Language Models”, ICLR 2022.
  6. QLoRA — Dettmers, T. et al. “QLoRA: Efficient Finetuning of Quantized Language Models”, NeurIPS 2023.
  7. RAFT — Zhang, T. et al. “RAFT: Adapting Language Model to Domain Specific RAG”, 2024.
  8. Knowledge Distillation — Hinton, G. et al. “Distilling the Knowledge in a Neural Network”, NeurIPS Workshop 2015.
  9. DoRA — Liu, S. et al. “DoRA: Weight-Decomposed Low-Rank Adaptation”, 2024.
  10. IA3 — Liu, H. et al. “Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning”, NeurIPS 2022.

Training & Alignment

  1. InstructGPT / RLHF — Ouyang, L. et al. “Training language models to follow instructions with human feedback”, NeurIPS 2022.
  2. DPO — Rafailov, R. et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”, NeurIPS 2023.
  3. Catastrophic Forgetting — Kirkpatrick, J. et al. “Overcoming catastrophic forgetting in neural networks”, PNAS 2017.
  4. Llama 3 — Dubey, A. et al. “The Llama 3 Herd of Models”, 2024.
  5. Code Llama — Rozière, B. et al. “Code Llama: Open Foundation Models for Code”, 2024.
  6. Reducing Hallucinations — Shuster, K. et al. “Retrieval Augmentation Reduces Hallucination in Conversation”, EMNLP 2021.

Domain-Specific Models

  1. BloombergGPT — Wu, S. et al. “BloombergGPT: A Large Language Model for Finance”, 2023.
  2. Med-PaLM 2 — Singhal, K. et al. “Towards Expert-Level Medical Question Answering with Large Language Models”, 2023.
  3. StarCoder — Li, R. et al. “StarCoder: May the Source Be with You!”, 2023.
  4. DeepSeek-Coder — Guo, D. et al. “DeepSeek-Coder: When the Large Language Model Meets Programming”, 2024.
  5. Codex — Chen, M. et al. “Evaluating Large Language Models Trained on Code”, 2021.
  6. PMC-LLaMA — Wu, C. et al. “PMC-LLaMA: Towards Building Open-source Language Models for Medicine”, 2023.
  7. SaulLM-7B — Colombo, P. et al. “SaulLM-7B: A pioneering Large Language Model for Law”, 2024.
  8. Orca 2 — Mitra, A. et al. “Orca 2: Teaching Small Language Models How to Reason”, 2023.

Books

  1. Chip Huyen — Designing Machine Learning Systems, O’Reilly, 2022. Covers production ML systems, evaluation, and deployment — directly relevant to deciding between augmentation and training.
  2. Chip Huyen — AI Engineering, O’Reilly, 2025. Comprehensive guide to building applications with foundation models — covers prompting, RAG, fine-tuning, and evaluation.
  3. Sebastian Raschka — Build a Large Language Model (From Scratch), Manning, 2024. Walks through pre-training, fine-tuning, and RLHF from first principles — essential for understanding the training side.
  4. Jay Alammar & Maarten Grootendorst — Hands-On Large Language Models, O’Reilly, 2024. Practical guide covering prompt engineering, RAG, fine-tuning, and multi-modal models with code examples.
  5. Cameron R. Wolfe — A Complete Guide to Fine-Tuning LLMs, Substack deep-dive series. Accessible introduction to LoRA, PEFT, and practical fine-tuning considerations.
  6. Sinan Ozdemir — Quick Start Guide to Large Language Models, Addison-Wesley, 2024. Covers the augmentation-to-training spectrum with practical examples and decision frameworks.

Tools & Platforms

  1. Hugging Face PEFT — Library for parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning, IA3).
  2. LangChain / LangGraph — Framework for building LLM applications with RAG, chains, agents, and tool use.
  3. Instructor — Library for structured LLM outputs using Pydantic models.
  4. Axolotl — Streamlined fine-tuning tool supporting LoRA, QLoRA, and full fine-tuning across multiple model architectures.
  5. Unsloth — 2–5× faster LoRA/QLoRA fine-tuning with 80% less memory.