How much data do you need to fine-tune an LLM?

It depends on the task. For simple format or style changes — like teaching a model to always respond in a specific JSON schema or to adopt a brand voice — dozens to a few hundred high-quality examples are often enough. For complex domain adaptation, such as teaching medical terminology or legal reasoning patterns, you typically need thousands to tens of thousands of examples. Quality consistently matters more than quantity: 200 carefully curated, representative examples routinely outperform 2,000 noisy ones.

What is the difference between fine-tuning and RAG?

Fine-tuning updates the model's weights so behavior is baked in — ideal for stable tasks where you want consistent style, format, or tone. Retrieval-Augmented Generation (RAG) keeps the model's weights unchanged and injects retrieved documents into the context at inference time — ideal for tasks where the underlying knowledge changes frequently (e.g., real-time product catalogs, live documentation). Fine-tuning is a one-time training cost; RAG is an ongoing retrieval infrastructure cost. Many production systems combine both: a fine-tuned model paired with a retrieval layer for factual grounding.

Can fine-tuning make a weak base model much stronger?

No. Fine-tuning extracts and sharpens capabilities the base model already has — it does not inject fundamentally new reasoning ability. A 3B parameter model fine-tuned on medical data will not suddenly reason as well as a 70B base model. Fine-tuning optimizes for a narrow slice of behavior; it cannot compensate for a base model that lacks the latent capability you need. This is why choosing the right base model before fine-tuning is the most important decision in any fine-tuning project.

What is LoRA and why does it make fine-tuning affordable?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds small adapter matrices alongside the model's existing weight matrices rather than updating all weights. A 7B parameter model has roughly 7 billion weights; LoRA might update only 4-40 million adapter parameters — less than 1% of the total. This dramatically reduces GPU memory requirements and training time. A model that would require 8× A100s to fine-tune fully can often be fine-tuned with LoRA on a single consumer RTX 4090. QLoRA adds 4-bit quantization on top, reducing memory further while retaining most of the quality gains.

What Is Fine-Tuning an LLM? Complete Guide to SFT, LoRA, RLHF & More (2026) | explainx.ai Blog

Q: What is fine-tuning an LLM?

Fine-tuning is the process of taking a large language model that was already pre-trained on a massive general corpus and continuing to train it on a smaller, domain-specific or task-specific dataset. The model's weights are updated during fine-tuning, which distinguishes it from prompting (where no weights change) and from training from scratch (where you start with random weights). The result is a model that has internalized specific behaviors, styles, or domain knowledge without needing long system prompts on every inference call.

The Three Ways to Customize an LLM

When a pre-trained language model does not behave the way you need it to, you have three levers: prompting, fine-tuning, and training from scratch.

Prompting is the cheapest lever. You write a system message that describes the behavior you want. No weights change. Every inference call carries the full prompt. The model's underlying behavior is unchanged — you are steering a general-purpose model with instructions on each request.

Training from scratch is the most expensive lever by orders of magnitude. You start with randomly initialized weights and train on hundreds of billions of tokens. This is what labs like Anthropic, OpenAI, and Meta do to create base models. The compute cost runs into millions of dollars for frontier-scale models. Almost no application team has a reason to do this.

Fine-tuning sits between those extremes. You take an existing pre-trained model — one that already understands language, can follow instructions, and has broad world knowledge — and continue training it on a curated, task-specific dataset. The model's weights are updated, but you start from a rich initialization rather than random noise. The result is a model that has internalized the new behavior without you bearing the full cost of pretraining.

The distinction from prompting is important: after fine-tuning, the behavior is baked into the weights. You do not need to re-explain it on every call. This has downstream effects on latency, cost, and consistency.

For a deeper look at what those weights actually are and how parameter counts translate to model capability, see What are parameters in a large language model?

Why Fine-Tune? The Business Case

Five concrete reasons to choose fine-tuning over longer prompts:

1. Style and format consistency

If you need a model to always output structured JSON, respond in a specific brand voice, or follow a proprietary document format, fine-tuning achieves this more reliably than prompt instructions. Prompts can be followed or ignored depending on context length and model attention; fine-tuned behavior is part of the model's weights and does not degrade across long conversations.

2. Proprietary domain knowledge

Pre-trained models are trained on public internet data up to a knowledge cutoff. They do not know your internal codebase conventions, your company's product taxonomy, your medical institution's clinical protocols, or your law firm's citation style. Fine-tuning on internal examples transfers that knowledge into the model — more reliably than dumping documents into a context window.

3. Reduced prompt length

A fine-tuned model has instructions baked in. A prompt that previously required 800 tokens of system instructions might drop to 50 tokens after fine-tuning. At scale — millions of inference calls per day — this is a meaningful cost reduction.

4. Latency improvement

Shorter prompts mean faster first-token latency. For latency-sensitive applications (voice agents, real-time coding assistants), shaving 1-2 seconds off every response matters.

5. Cost reduction vs frontier models

A fine-tuned 7B or 13B open-source model running on your own infrastructure often outperforms a raw call to a frontier API for narrow tasks — at a fraction of the per-token cost. If your task is well-defined and your fine-tuning dataset is good, you are paying for general reasoning you do not need.

Supervised Fine-Tuning (SFT): The Foundation

The most common and straightforward form of fine-tuning is supervised fine-tuning. The setup is simple: you create a dataset of input-output pairs that represent the behavior you want the model to learn, then train the model to predict the output given the input.

What the dataset looks like

The industry-standard format for SFT datasets is JSONL (JSON Lines), where each line is a training example. For instruction-following models, examples typically use a chat template:

jsonl

{"messages": [{"role": "system", "content": "You are a customer support agent for Acme Inc. Always respond concisely and offer a follow-up action."}, {"role": "user", "content": "My order hasn't arrived yet."}, {"role": "assistant", "content": "I'm sorry to hear that. Can you share your order number? I'll check the status and, if needed, arrange a re-shipment within 24 hours."}]}
{"messages": [{"role": "system", "content": "You are a customer support agent for Acme Inc. Always respond concisely and offer a follow-up action."}, {"role": "user", "content": "How do I update my billing address?"}, {"role": "assistant", "content": "Go to Account Settings > Billing > Edit Address. The change takes effect on your next invoice. Would you like me to walk you through each step?"}]}

Each line is a complete conversation. The model learns to produce the assistant turn given the preceding context.

How many examples do you need?

This is the question every fine-tuning project starts with, and the honest answer is: it depends on task complexity.

Task type	Rough example count	Notes
Format / schema change	50–200	JSON output, specific template adherence
Tone / style change	100–500	Brand voice, formality level
Domain terminology	500–2,000	Medical, legal, financial jargon
Complex reasoning patterns	2,000–20,000+	Multi-step domain-specific logic
General task specialization	10,000–100,000	Broad domain adaptation

Quality dominates quantity. A dataset of 300 carefully reviewed, representative examples consistently outperforms 3,000 scraped, noisy ones. Before scaling your dataset, invest in dataset quality: remove duplicates, fix inconsistencies, and manually review a random sample.

The training loop

During SFT, the model runs its standard forward pass on each example and computes a cross-entropy loss between its predicted token probabilities and the target tokens in the training output. Gradients flow backward through the network and the optimizer updates the weights. The model is not generating text during training — it is being scored against the ground truth output token by token.

RLHF: From Correct to Preferred

Supervised fine-tuning teaches a model to produce outputs that match your examples. But matching examples is not the same as producing outputs that humans genuinely prefer — especially when there are multiple valid answers of differing quality.

Reinforcement Learning from Human Feedback (RLHF) addresses this. The process has three stages:

Stage 1: SFT Train an initial model using supervised fine-tuning on high-quality demonstrations, as described above.

Stage 2: Reward model training Collect human preference data: show raters pairs of model outputs for the same prompt and ask which they prefer. Train a separate reward model that learns to predict human preference scores for arbitrary model outputs.

Stage 3: RL optimization Use the reward model as a signal to optimize the SFT model via reinforcement learning — typically Proximal Policy Optimization (PPO). The model generates responses, the reward model scores them, and the RL update pushes the model toward higher-scoring responses. A KL-divergence penalty keeps the model from drifting too far from the SFT baseline.

The result is a model that does not merely reproduce training examples but produces outputs that score highly on human preference — more helpful, more truthful, less harmful. This is the training approach behind InstructGPT, ChatGPT, and most production-grade chat models.

For a deeper treatment of RLHF and how it connects to Constitutional AI and scalable oversight — including why human feedback alone cannot scale to frontier model complexity — see Scalable oversight: from human feedback to constitutions and "weak-to-strong" intuition.

LoRA and QLoRA: Fine-Tuning Without Full GPU Clusters

Full fine-tuning of a large language model updates every parameter in the network. A 7B parameter model in 16-bit precision occupies roughly 14GB of memory just to store the weights — before accounting for optimizer states, gradients, and activations, which typically multiply memory requirements by 4-8x. Full fine-tuning of a 7B model in practice requires roughly 80-100GB of GPU memory. That means multiple A100s.

LoRA (Low-Rank Adaptation) changes the math dramatically.

The core idea

Instead of updating the full weight matrix W (which might be 4096×4096 = 16.7M parameters), LoRA freezes W and adds two small matrices: A (4096×r) and B (r×4096), where r is a small "rank" parameter — typically 4, 8, 16, or 64. The effective weight update is A×B, which has r×(4096+4096) = much fewer parameters. At rank 16, that's 16 × 8192 = 131,072 parameters instead of 16.7M. Only A and B are trained.

python

# Conceptually, LoRA changes:
# output = W @ input
# to:
# output = (W + A @ B) @ input
# where W is frozen and only A, B are trained

The rank hyperparameter r controls the expressiveness of the update. Higher rank means more capacity to capture changes — but also more parameters and more memory. For most fine-tuning tasks, r=8 to r=32 is a good starting point.

Why this matters in practice

Approach	7B model GPU memory	Hardware requirement
Full fine-tuning (bf16)	~80-100GB	2-4× A100 80GB
LoRA (r=16)	~16-20GB	1× A100 40GB or RTX 4090
QLoRA (4-bit + r=16)	~8-12GB	RTX 3090 / RTX 4090

QLoRA combines LoRA with 4-bit quantization of the frozen base model weights. The base model is loaded in NF4 (Normal Float 4) format, and LoRA adapters are trained in bf16. This roughly halves the memory footprint again, making 7B fine-tuning feasible on a single consumer GPU with 24GB VRAM.

The practical implication: you can fine-tune a capable open-source model for a specialized task on hardware that costs a few hundred dollars per month to rent, rather than thousands. This is why LoRA and QLoRA have become the default approach for fine-tuning in production engineering teams.

Knowledge Distillation: The Teacher-Student Model

Fine-tuning with LoRA optimizes an existing model for a specific task. Knowledge distillation is a different technique: it uses a larger, more capable teacher model to train a smaller student model to perform nearly as well on a target distribution.

The key insight is that a large model's output probability distributions carry more information than just the final answer. When a teacher model predicts the next token, its confidence scores across the vocabulary encode nuanced uncertainty — for instance, the teacher might assign 40% probability to "cat," 35% to "dog," and 25% to "animal" rather than just outputting "cat." A student trained to match these soft labels rather than just the hard ground-truth labels learns richer representations.

Distillation from RL checkpoints

A recent and increasingly important variant — highlighted in the VibeThinker 3B paper — is distillation from reinforcement learning checkpoints. Here, the teacher is not just a large pre-trained model but a model that has already undergone expensive RL training to develop specific reasoning behaviors. The student is trained on the teacher's RL-refined outputs, absorbing the reasoning patterns without running the full RL process itself.

This is why VibeThinker 3B, a 3-billion-parameter model, can match Claude Opus 4.5 on specific coding benchmarks: it was distilled from a larger RL-trained teacher, then given its own RL instruct pass. The combination is remarkably sample-efficient for narrow task domains.

Distillation is especially valuable when:

You need a model small enough to run locally or on edge devices
Your task is well-defined and the teacher model performs it well
You want to avoid the cost of RL training on the student directly
Inference cost is a hard constraint (smaller model = faster + cheaper)

Fine-Tuning vs RAG vs Prompting: A Decision Matrix

This is the question practitioners get wrong most often. The answer is not "one is better" — the three approaches solve different problems and are often combined.

Dimension	Prompting	RAG	Fine-Tuning
Knowledge type	General (whatever base model knows)	Dynamic, retrieved from external store	Static, baked into weights
Data freshness	Real-time via prompt	Real-time via retrieval	Stale after training
Setup cost	Minimal	Medium (build retrieval pipeline)	High (dataset + training)
Inference cost	Higher (long prompts)	Medium (retrieval + shorter prompt)	Lower (short prompts)
Behavior consistency	Variable (prompt sensitive)	Variable	High (baked in)
Best for	Flexible tasks, unclear requirements	Changing data, document Q&A	Stable tasks, format/style/domain

When to choose each

Use prompting when: requirements are still evolving, you need to handle a wide variety of tasks, or the task involves general reasoning where the base model already performs adequately.

Use RAG when: the task requires factual information that changes frequently (product catalogs, documentation, news), or the knowledge base is too large to bake into weights, or you need citations and provenance for retrieved information.

Use fine-tuning when: you want consistent behavior across all invocations, the task has a specific format or style that prompts cannot reliably enforce, you have high inference volume and want shorter prompts to reduce costs, or you are working with proprietary domain knowledge that the base model genuinely lacks.

Combine fine-tuning + RAG when: you want consistent style and behavior (from fine-tuning) plus access to current factual information (from retrieval). A fine-tuned model with a retrieval layer is the architecture most production systems converge to.

For the broader question of whether to use open-source fine-tunable models vs closed API models, see Closed-source AI vs local open-source alternatives 2026.

Open-Source Fine-Tuning vs Closed API Fine-Tuning

The landscape for fine-tuning divides sharply between open-source models you train yourself and closed-model fine-tuning APIs offered by providers.

Open-source fine-tuning

Models like Meta Llama 3.3, Qwen 2.5, Mistral, and Gemma 3 can be fine-tuned on your own infrastructure or cloud compute. The workflow typically uses:

Hugging Face Transformers + PEFT library for LoRA/QLoRA
TRL (Transformer Reinforcement Learning) for SFT and RLHF
Axolotl as a higher-level orchestration layer for fine-tuning runs
Unsloth for significantly faster training with optimized CUDA kernels

You own the resulting weights. You can deploy anywhere, run locally, and fine-tune iteratively without per-call API costs. The tradeoff is infrastructure complexity and the cost of managing training compute.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.3-70B-Instruct",
    load_in_4bit=True,  # QLoRA: 4-bit base model
)

lora_config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="./fine-tuned-model",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        warmup_ratio=0.03,
        lr_scheduler_type="cosine",
    ),
    train_dataset=train_dataset,
)
trainer.train()

Closed-model fine-tuning APIs

OpenAI offered fine-tuning for GPT-3.5 and GPT-4 series models. The workflow was simpler — upload a JSONL file, call the fine-tuning API, get a fine-tuned model ID back — but the resulting weights were hosted by OpenAI. As covered in OpenAI Winds Down Fine-Tuning API, OpenAI announced in May 2026 that it was winding down its fine-tuning platform, giving customers until January 6, 2027 to create new training jobs.

Anthropic has taken a different position: there is no public fine-tuning API for Claude models. Anthropic's rationale centers on safety — fine-tuning is a mechanism by which carefully trained alignment properties can be degraded, intentionally or not, and Anthropic has not yet built the tooling to offer fine-tuning safely at scale while maintaining its safety standards. The company's approach to customization is instead through system prompts, long context, and its Models API.

Open-source models are increasingly fine-tunable without the API costs of closed systems.

What Fine-Tuning Cannot Do

Understanding the limits of fine-tuning is as important as understanding what it enables.

1. Fine-tuning cannot reliably add new factual knowledge

This is the most common misconception. If you train a model on 1,000 examples containing facts that were not in the pretraining data, the model may appear to learn those facts during training — but knowledge injection through fine-tuning is unreliable. The model is better at interpolating between things it already knows than at genuinely memorizing new facts from fine-tuning data.

Catastrophic forgetting compounds this: as a model fine-tunes heavily on new data, it can degrade performance on knowledge it had before. This is why RAG is the correct tool for knowledge that changes or that was never in pretraining, while fine-tuning is the correct tool for behavior and style adaptation.

2. Fine-tuning cannot make a weak base model strong

Fine-tuning extracts and sharpens capabilities the base model already has latent. If the base model cannot reason through multi-step legal analysis, fine-tuning it on 10,000 legal documents will improve its legal vocabulary and formatting — but not its fundamental reasoning quality.

The most important decision in any fine-tuning project is: choose the right base model first. A well-fine-tuned Llama 3.3 70B will outperform a heavily fine-tuned 7B model on complex tasks. The base model's pretraining scale sets the ceiling; fine-tuning moves you closer to that ceiling on a specific task.

3. Fine-tuning can introduce bias if the dataset is biased

Your fine-tuning dataset is a direct lever on model behavior. If your dataset systematically underrepresents certain cases, overrepresents a particular perspective, or contains errors, the fine-tuned model will reflect those biases more strongly than the base model did. A biased reward model in an RLHF pipeline will produce a biased fine-tuned model.

Dataset curation — reviewing examples, ensuring diversity of cases, checking for label errors — is not optional overhead. It is where most fine-tuning quality problems originate.

4. Fine-tuning has ongoing maintenance cost

Unlike prompting, which you can iterate on daily, a fine-tuned model requires retraining to update. If your task requirements evolve, you need to rebuild the dataset and retrain. For tasks that change rapidly, this maintenance overhead may exceed the inference savings.

Practical Guide: Running Your First Fine-Tuning Job

Step 1: Define the task precisely

Write down, in one paragraph, exactly what behavior you want the fine-tuned model to have that the base model plus a good system prompt does not reliably produce. If you cannot write that paragraph, you are not ready to fine-tune.

Step 2: Prepare the dataset

Format: JSONL with conversation turns, as shown in the SFT section above.

Size: Start with 100-500 examples. You can always add more. Starting too large means slow iteration cycles.

Quality checks:

Manually review a random 10% sample before training
Remove examples where the assistant output is wrong, unclear, or inconsistent
Ensure the system prompt in your training data exactly matches what you will use at inference
Balance the dataset — if 90% of examples are about one subtopic and 10% about another, the model will underperform on the minority

Split: Reserve 10-20% of examples as a hold-out evaluation set. Never train on your eval set.

Step 3: Choose your training setup

For open-source models with QLoRA:

Hyperparameter	Typical starting value	Notes
Learning rate	2e-4 to 5e-4	Higher than full fine-tuning; LoRA adapters train faster
LoRA rank (r)	16	Increase to 32-64 for complex tasks
LoRA alpha	2× rank	Controls scaling of the adapter output
Batch size	4-8 per device	Increase with gradient accumulation if OOM
Epochs	2-4	More epochs = more overfitting risk on small datasets
Warmup ratio	0.03	Gradual LR warmup for stability
LR scheduler	cosine	Decays LR smoothly over training
Max sequence length	2048-4096	Match your inference context window

Step 4: Evaluate

Do not rely on training loss alone. Evaluate on your held-out set with the same metrics you care about in production:

Format adherence: Does the model consistently produce the expected output format?
Domain accuracy: Does a domain expert rate the outputs as correct?
Regression testing: Does the fine-tuned model still handle edge cases the base model handled well?
A/B comparison: Have raters prefer the fine-tuned model vs the base model + system prompt?

Step 5: Iterate

Fine-tuning is iterative. The first run rarely produces the best model. Common issues and fixes:

Problem	Likely cause	Fix
Model ignores system prompt	System prompt not in training data	Add system prompt to every training example
Outputs too short / too long	Training examples are too short / too long	Adjust training data length distribution
Model forgets base knowledge	Learning rate too high or too many epochs	Reduce LR, add max 3 epochs, use LoRA (freezes base)
Behavior inconsistent	Dataset too small or too noisy	Add examples, manually clean dataset
Format failures	Not enough format-critical examples	Add more examples that exercise the format

Decide whether fine-tuning is the right intervention

Before building a training pipeline, use the prompt engineering vs fine-tuning vs RAG decision tree. It separates instruction failures, missing knowledge, and stable behavior problems, with a worked support-system architecture. The accompanying open-weight vs closed-model framework covers whether weight-level control justifies self-hosting.

The 2026 Context: RL-Based Post-Training and the Fine-Tuning Landscape

The fine-tuning story in 2026 is increasingly about RL-based post-training, not just SFT.

The pattern that labs discovered — and that smaller teams are now replicating with open-source models — is that reinforcement learning on verifiable outcomes is dramatically more efficient than SFT for tasks with clear success criteria. Coding (does the code pass the tests?), math (is the answer correct?), and tool use (did the tool call succeed?) all have reward signals that can be computed automatically. This removes the need for expensive human preference data.

The implication for fine-tuning practitioners: for tasks where you have a verifiable outcome, consider building an RL fine-tuning pipeline rather than stopping at SFT. The TRL library supports GRPO (Group Relative Policy Optimization), which is more stable than PPO for small-scale RL fine-tuning and requires less infrastructure.

The broader industry shift is also moving away from single large fine-tuned models toward specialized small models that are distilled from frontier RL checkpoints — as VibeThinker 3B illustrates. This is a different cost structure: instead of paying per-token to a frontier API, you pay once for training a small model and deploy it cheaply forever. For well-defined production tasks, this economics increasingly favors fine-tuning.

Meanwhile, OpenAI's wind-down of its fine-tuning API and Anthropic's absence from the fine-tuning market are pushing enterprise teams toward open-source model fine-tuning — a trend that is accelerating the maturity of open-source fine-tuning tooling.

The Three Ways to Customize an LLM

When a pre-trained language model does not behave the way you need it to, you have three levers: prompting, fine-tuning, and training from scratch.

For a deeper look at what those weights actually are and how parameter counts translate to model capability, see What are parameters in a large language model?

Why Fine-Tune? The Business Case

Five concrete reasons to choose fine-tuning over longer prompts:

1. Style and format consistency

2. Proprietary domain knowledge

3. Reduced prompt length

4. Latency improvement

Shorter prompts mean faster first-token latency. For latency-sensitive applications (voice agents, real-time coding assistants), shaving 1-2 seconds off every response matters.

5. Cost reduction vs frontier models

Supervised Fine-Tuning (SFT): The Foundation

What the dataset looks like

The industry-standard format for SFT datasets is JSONL (JSON Lines), where each line is a training example. For instruction-following models, examples typically use a chat template:

jsonl

{"messages": [{"role": "system", "content": "You are a customer support agent for Acme Inc. Always respond concisely and offer a follow-up action."}, {"role": "user", "content": "My order hasn't arrived yet."}, {"role": "assistant", "content": "I'm sorry to hear that. Can you share your order number? I'll check the status and, if needed, arrange a re-shipment within 24 hours."}]}
{"messages": [{"role": "system", "content": "You are a customer support agent for Acme Inc. Always respond concisely and offer a follow-up action."}, {"role": "user", "content": "How do I update my billing address?"}, {"role": "assistant", "content": "Go to Account Settings > Billing > Edit Address. The change takes effect on your next invoice. Would you like me to walk you through each step?"}]}

Each line is a complete conversation. The model learns to produce the assistant turn given the preceding context.

How many examples do you need?

This is the question every fine-tuning project starts with, and the honest answer is: it depends on task complexity.

Task type	Rough example count	Notes
Format / schema change	50–200	JSON output, specific template adherence
Tone / style change	100–500	Brand voice, formality level
Domain terminology	500–2,000	Medical, legal, financial jargon
Complex reasoning patterns	2,000–20,000+	Multi-step domain-specific logic
General task specialization	10,000–100,000	Broad domain adaptation

The training loop

RLHF: From Correct to Preferred

Reinforcement Learning from Human Feedback (RLHF) addresses this. The process has three stages:

Stage 1: SFT Train an initial model using supervised fine-tuning on high-quality demonstrations, as described above.

LoRA and QLoRA: Fine-Tuning Without Full GPU Clusters

LoRA (Low-Rank Adaptation) changes the math dramatically.

The core idea

python

# Conceptually, LoRA changes:
# output = W @ input
# to:
# output = (W + A @ B) @ input
# where W is frozen and only A, B are trained

Why this matters in practice

Approach	7B model GPU memory	Hardware requirement
Full fine-tuning (bf16)	~80-100GB	2-4× A100 80GB
LoRA (r=16)	~16-20GB	1× A100 40GB or RTX 4090
QLoRA (4-bit + r=16)	~8-12GB	RTX 3090 / RTX 4090

Knowledge Distillation: The Teacher-Student Model

Distillation from RL checkpoints

Distillation is especially valuable when:

You need a model small enough to run locally or on edge devices
Your task is well-defined and the teacher model performs it well
You want to avoid the cost of RL training on the student directly
Inference cost is a hard constraint (smaller model = faster + cheaper)

Fine-Tuning vs RAG vs Prompting: A Decision Matrix

This is the question practitioners get wrong most often. The answer is not "one is better" — the three approaches solve different problems and are often combined.

Dimension	Prompting	RAG	Fine-Tuning
Knowledge type	General (whatever base model knows)	Dynamic, retrieved from external store	Static, baked into weights
Data freshness	Real-time via prompt	Real-time via retrieval	Stale after training
Setup cost	Minimal	Medium (build retrieval pipeline)	High (dataset + training)
Inference cost	Higher (long prompts)	Medium (retrieval + shorter prompt)	Lower (short prompts)
Behavior consistency	Variable (prompt sensitive)	Variable	High (baked in)
Best for	Flexible tasks, unclear requirements	Changing data, document Q&A	Stable tasks, format/style/domain

When to choose each

Use prompting when: requirements are still evolving, you need to handle a wide variety of tasks, or the task involves general reasoning where the base model already performs adequately.

For the broader question of whether to use open-source fine-tunable models vs closed API models, see Closed-source AI vs local open-source alternatives 2026.

Open-Source Fine-Tuning vs Closed API Fine-Tuning

The landscape for fine-tuning divides sharply between open-source models you train yourself and closed-model fine-tuning APIs offered by providers.

Open-source fine-tuning

Models like Meta Llama 3.3, Qwen 2.5, Mistral, and Gemma 3 can be fine-tuned on your own infrastructure or cloud compute. The workflow typically uses:

Hugging Face Transformers + PEFT library for LoRA/QLoRA
TRL (Transformer Reinforcement Learning) for SFT and RLHF
Axolotl as a higher-level orchestration layer for fine-tuning runs
Unsloth for significantly faster training with optimized CUDA kernels

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.3-70B-Instruct",
    load_in_4bit=True,  # QLoRA: 4-bit base model
)

lora_config = LoraConfig(
    r=16,                    # rank
    lora_alpha=32,           # scaling factor
    target_modules=["q_proj", "v_proj"],  # which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="./fine-tuned-model",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-4,
        warmup_ratio=0.03,
        lr_scheduler_type="cosine",
    ),
    train_dataset=train_dataset,
)
trainer.train()

Closed-model fine-tuning APIs

Open-source models are increasingly fine-tunable without the API costs of closed systems.

What Fine-Tuning Cannot Do

Understanding the limits of fine-tuning is as important as understanding what it enables.

1. Fine-tuning cannot reliably add new factual knowledge

2. Fine-tuning cannot make a weak base model strong

3. Fine-tuning can introduce bias if the dataset is biased

Dataset curation — reviewing examples, ensuring diversity of cases, checking for label errors — is not optional overhead. It is where most fine-tuning quality problems originate.

4. Fine-tuning has ongoing maintenance cost

Practical Guide: Running Your First Fine-Tuning Job

Step 1: Define the task precisely

Step 2: Prepare the dataset

Format: JSONL with conversation turns, as shown in the SFT section above.

Size: Start with 100-500 examples. You can always add more. Starting too large means slow iteration cycles.

Quality checks:

Manually review a random 10% sample before training
Remove examples where the assistant output is wrong, unclear, or inconsistent
Ensure the system prompt in your training data exactly matches what you will use at inference
Balance the dataset — if 90% of examples are about one subtopic and 10% about another, the model will underperform on the minority

Split: Reserve 10-20% of examples as a hold-out evaluation set. Never train on your eval set.

Step 3: Choose your training setup

For open-source models with QLoRA:

Hyperparameter	Typical starting value	Notes
Learning rate	2e-4 to 5e-4	Higher than full fine-tuning; LoRA adapters train faster
LoRA rank (r)	16	Increase to 32-64 for complex tasks
LoRA alpha	2× rank	Controls scaling of the adapter output
Batch size	4-8 per device	Increase with gradient accumulation if OOM
Epochs	2-4	More epochs = more overfitting risk on small datasets
Warmup ratio	0.03	Gradual LR warmup for stability
LR scheduler	cosine	Decays LR smoothly over training
Max sequence length	2048-4096	Match your inference context window

Step 4: Evaluate

Do not rely on training loss alone. Evaluate on your held-out set with the same metrics you care about in production:

Format adherence: Does the model consistently produce the expected output format?
Domain accuracy: Does a domain expert rate the outputs as correct?
Regression testing: Does the fine-tuned model still handle edge cases the base model handled well?
A/B comparison: Have raters prefer the fine-tuned model vs the base model + system prompt?

Step 5: Iterate

Fine-tuning is iterative. The first run rarely produces the best model. Common issues and fixes:

Problem	Likely cause	Fix
Model ignores system prompt	System prompt not in training data	Add system prompt to every training example
Outputs too short / too long	Training examples are too short / too long	Adjust training data length distribution
Model forgets base knowledge	Learning rate too high or too many epochs	Reduce LR, add max 3 epochs, use LoRA (freezes base)
Behavior inconsistent	Dataset too small or too noisy	Add examples, manually clean dataset
Format failures	Not enough format-critical examples	Add more examples that exercise the format

Decide whether fine-tuning is the right intervention

The 2026 Context: RL-Based Post-Training and the Fine-Tuning Landscape

The fine-tuning story in 2026 is increasingly about RL-based post-training, not just SFT.

The Three Ways to Customize an LLM

Why Fine-Tune? The Business Case

1. Style and format consistency

2. Proprietary domain knowledge

3. Reduced prompt length

4. Latency improvement

5. Cost reduction vs frontier models

Supervised Fine-Tuning (SFT): The Foundation

What the dataset looks like

How many examples do you need?

The training loop

RLHF: From Correct to Preferred

LoRA and QLoRA: Fine-Tuning Without Full GPU Clusters

The core idea

Why this matters in practice

Knowledge Distillation: The Teacher-Student Model

Distillation from RL checkpoints

Fine-Tuning vs RAG vs Prompting: A Decision Matrix

When to choose each

Open-Source Fine-Tuning vs Closed API Fine-Tuning

Open-source fine-tuning

Closed-model fine-tuning APIs

What Fine-Tuning Cannot Do

1. Fine-tuning cannot reliably add new factual knowledge

2. Fine-tuning cannot make a weak base model strong

3. Fine-tuning can introduce bias if the dataset is biased

4. Fine-tuning has ongoing maintenance cost

Practical Guide: Running Your First Fine-Tuning Job

Step 1: Define the task precisely

Step 2: Prepare the dataset

Step 3: Choose your training setup

Step 4: Evaluate

Step 5: Iterate

Decide whether fine-tuning is the right intervention

The 2026 Context: RL-Based Post-Training and the Fine-Tuning Landscape

Related Reading

The Three Ways to Customize an LLM

Why Fine-Tune? The Business Case

1. Style and format consistency

2. Proprietary domain knowledge

3. Reduced prompt length

4. Latency improvement

5. Cost reduction vs frontier models

Supervised Fine-Tuning (SFT): The Foundation

What the dataset looks like

How many examples do you need?

The training loop

RLHF: From Correct to Preferred

LoRA and QLoRA: Fine-Tuning Without Full GPU Clusters

The core idea

Why this matters in practice

Knowledge Distillation: The Teacher-Student Model

Distillation from RL checkpoints

Fine-Tuning vs RAG vs Prompting: A Decision Matrix

When to choose each

Open-Source Fine-Tuning vs Closed API Fine-Tuning

Open-source fine-tuning

Closed-model fine-tuning APIs

What Fine-Tuning Cannot Do

1. Fine-tuning cannot reliably add new factual knowledge

2. Fine-tuning cannot make a weak base model strong

3. Fine-tuning can introduce bias if the dataset is biased

4. Fine-tuning has ongoing maintenance cost

Practical Guide: Running Your First Fine-Tuning Job

Step 1: Define the task precisely

Step 2: Prepare the dataset

Step 3: Choose your training setup

Step 4: Evaluate

Step 5: Iterate

Decide whether fine-tuning is the right intervention

The 2026 Context: RL-Based Post-Training and the Fine-Tuning Landscape

Related Reading

Related posts

What Is an Embedding? Plain-English Examples (2026)

Is AI Conscious? The Philosophy Behind the Question Everyone Is Afraid to Ask

The History of Artificial Intelligence: From Turing's 1950 Test to AGI in 2026

Related posts

What Is an Embedding? Plain-English Examples (2026)

Is AI Conscious? The Philosophy Behind the Question Everyone Is Afraid to Ask

The History of Artificial Intelligence: From Turing's 1950 Test to AGI in 2026