The Three Ways to Customize an LLM
When a pre-trained language model does not behave the way you need it to, you have three levers: prompting, fine-tuning, and training from scratch.
Prompting is the cheapest lever. You write a system message that describes the behavior you want. No weights change. Every inference call carries the full prompt. The model's underlying behavior is unchanged — you are steering a general-purpose model with instructions on each request.
Training from scratch is the most expensive lever by orders of magnitude. You start with randomly initialized weights and train on hundreds of billions of tokens. This is what labs like Anthropic, OpenAI, and Meta do to create base models. The compute cost runs into millions of dollars for frontier-scale models. Almost no application team has a reason to do this.
Fine-tuning sits between those extremes. You take an existing pre-trained model — one that already understands language, can follow instructions, and has broad world knowledge — and continue training it on a curated, task-specific dataset. The model's weights are updated, but you start from a rich initialization rather than random noise. The result is a model that has internalized the new behavior without you bearing the full cost of pretraining.
The distinction from prompting is important: after fine-tuning, the behavior is baked into the weights. You do not need to re-explain it on every call. This has downstream effects on latency, cost, and consistency.
For a deeper look at what those weights actually are and how parameter counts translate to model capability, see What are parameters in a large language model?
Why Fine-Tune? The Business Case
Five concrete reasons to choose fine-tuning over longer prompts:
1. Style and format consistency
If you need a model to always output structured JSON, respond in a specific brand voice, or follow a proprietary document format, fine-tuning achieves this more reliably than prompt instructions. Prompts can be followed or ignored depending on context length and model attention; fine-tuned behavior is part of the model's weights and does not degrade across long conversations.
2. Proprietary domain knowledge
Pre-trained models are trained on public internet data up to a knowledge cutoff. They do not know your internal codebase conventions, your company's product taxonomy, your medical institution's clinical protocols, or your law firm's citation style. Fine-tuning on internal examples transfers that knowledge into the model — more reliably than dumping documents into a context window.
3. Reduced prompt length
A fine-tuned model has instructions baked in. A prompt that previously required 800 tokens of system instructions might drop to 50 tokens after fine-tuning. At scale — millions of inference calls per day — this is a meaningful cost reduction.
4. Latency improvement
Shorter prompts mean faster first-token latency. For latency-sensitive applications (voice agents, real-time coding assistants), shaving 1-2 seconds off every response matters.
5. Cost reduction vs frontier models
A fine-tuned 7B or 13B open-source model running on your own infrastructure often outperforms a raw call to a frontier API for narrow tasks — at a fraction of the per-token cost. If your task is well-defined and your fine-tuning dataset is good, you are paying for general reasoning you do not need.
Supervised Fine-Tuning (SFT): The Foundation
The most common and straightforward form of fine-tuning is supervised fine-tuning. The setup is simple: you create a dataset of input-output pairs that represent the behavior you want the model to learn, then train the model to predict the output given the input.
What the dataset looks like
The industry-standard format for SFT datasets is JSONL (JSON Lines), where each line is a training example. For instruction-following models, examples typically use a chat template:
{"messages": [{"role": "system", "content": "You are a customer support agent for Acme Inc. Always respond concisely and offer a follow-up action."}, {"role": "user", "content": "My order hasn't arrived yet."}, {"role": "assistant", "content": "I'm sorry to hear that. Can you share your order number? I'll check the status and, if needed, arrange a re-shipment within 24 hours."}]}
{"messages": [{"role": "system", "content": "You are a customer support agent for Acme Inc. Always respond concisely and offer a follow-up action."}, {"role": "user", "content": "How do I update my billing address?"}, {"role": "assistant", "content": "Go to Account Settings > Billing > Edit Address. The change takes effect on your next invoice. Would you like me to walk you through each step?"}]}
Each line is a complete conversation. The model learns to produce the assistant turn given the preceding context.
How many examples do you need?
This is the question every fine-tuning project starts with, and the honest answer is: it depends on task complexity.
| Task type | Rough example count | Notes |
|---|---|---|
| Format / schema change | 50–200 | JSON output, specific template adherence |
| Tone / style change | 100–500 | Brand voice, formality level |
| Domain terminology | 500–2,000 | Medical, legal, financial jargon |
| Complex reasoning patterns | 2,000–20,000+ | Multi-step domain-specific logic |
| General task specialization | 10,000–100,000 | Broad domain adaptation |
Quality dominates quantity. A dataset of 300 carefully reviewed, representative examples consistently outperforms 3,000 scraped, noisy ones. Before scaling your dataset, invest in dataset quality: remove duplicates, fix inconsistencies, and manually review a random sample.
The training loop
During SFT, the model runs its standard forward pass on each example and computes a cross-entropy loss between its predicted token probabilities and the target tokens in the training output. Gradients flow backward through the network and the optimizer updates the weights. The model is not generating text during training — it is being scored against the ground truth output token by token.
RLHF: From Correct to Preferred
Supervised fine-tuning teaches a model to produce outputs that match your examples. But matching examples is not the same as producing outputs that humans genuinely prefer — especially when there are multiple valid answers of differing quality.
Reinforcement Learning from Human Feedback (RLHF) addresses this. The process has three stages:
Stage 1: SFT Train an initial model using supervised fine-tuning on high-quality demonstrations, as described above.
Stage 2: Reward model training Collect human preference data: show raters pairs of model outputs for the same prompt and ask which they prefer. Train a separate reward model that learns to predict human preference scores for arbitrary model outputs.
Stage 3: RL optimization Use the reward model as a signal to optimize the SFT model via reinforcement learning — typically Proximal Policy Optimization (PPO). The model generates responses, the reward model scores them, and the RL update pushes the model toward higher-scoring responses. A KL-divergence penalty keeps the model from drifting too far from the SFT baseline.
The result is a model that does not merely reproduce training examples but produces outputs that score highly on human preference — more helpful, more truthful, less harmful. This is the training approach behind InstructGPT, ChatGPT, and most production-grade chat models.
For a deeper treatment of RLHF and how it connects to Constitutional AI and scalable oversight — including why human feedback alone cannot scale to frontier model complexity — see Scalable oversight: from human feedback to constitutions and "weak-to-strong" intuition.
LoRA and QLoRA: Fine-Tuning Without Full GPU Clusters
Full fine-tuning of a large language model updates every parameter in the network. A 7B parameter model in 16-bit precision occupies roughly 14GB of memory just to store the weights — before accounting for optimizer states, gradients, and activations, which typically multiply memory requirements by 4-8x. Full fine-tuning of a 7B model in practice requires roughly 80-100GB of GPU memory. That means multiple A100s.
LoRA (Low-Rank Adaptation) changes the math dramatically.
The core idea
Instead of updating the full weight matrix W (which might be 4096×4096 = 16.7M parameters), LoRA freezes W and adds two small matrices: A (4096×r) and B (r×4096), where r is a small "rank" parameter — typically 4, 8, 16, or 64. The effective weight update is A×B, which has r×(4096+4096) = much fewer parameters. At rank 16, that's 16 × 8192 = 131,072 parameters instead of 16.7M. Only A and B are trained.
# Conceptually, LoRA changes:
# output = W @ input
# to:
# output = (W + A @ B) @ input
# where W is frozen and only A, B are trained
The rank hyperparameter r controls the expressiveness of the update. Higher rank means more capacity to capture changes — but also more parameters and more memory. For most fine-tuning tasks, r=8 to r=32 is a good starting point.
Why this matters in practice
| Approach | 7B model GPU memory | Hardware requirement |
|---|---|---|
| Full fine-tuning (bf16) | ~80-100GB | 2-4× A100 80GB |
| LoRA (r=16) | ~16-20GB | 1× A100 40GB or RTX 4090 |
| QLoRA (4-bit + r=16) | ~8-12GB | RTX 3090 / RTX 4090 |
QLoRA combines LoRA with 4-bit quantization of the frozen base model weights. The base model is loaded in NF4 (Normal Float 4) format, and LoRA adapters are trained in bf16. This roughly halves the memory footprint again, making 7B fine-tuning feasible on a single consumer GPU with 24GB VRAM.
The practical implication: you can fine-tune a capable open-source model for a specialized task on hardware that costs a few hundred dollars per month to rent, rather than thousands. This is why LoRA and QLoRA have become the default approach for fine-tuning in production engineering teams.
Knowledge Distillation: The Teacher-Student Model
Fine-tuning with LoRA optimizes an existing model for a specific task. Knowledge distillation is a different technique: it uses a larger, more capable teacher model to train a smaller student model to perform nearly as well on a target distribution.
The key insight is that a large model's output probability distributions carry more information than just the final answer. When a teacher model predicts the next token, its confidence scores across the vocabulary encode nuanced uncertainty — for instance, the teacher might assign 40% probability to "cat," 35% to "dog," and 25% to "animal" rather than just outputting "cat." A student trained to match these soft labels rather than just the hard ground-truth labels learns richer representations.
Distillation from RL checkpoints
A recent and increasingly important variant — highlighted in the VibeThinker 3B paper — is distillation from reinforcement learning checkpoints. Here, the teacher is not just a large pre-trained model but a model that has already undergone expensive RL training to develop specific reasoning behaviors. The student is trained on the teacher's RL-refined outputs, absorbing the reasoning patterns without running the full RL process itself.
This is why VibeThinker 3B, a 3-billion-parameter model, can match Claude Opus 4.5 on specific coding benchmarks: it was distilled from a larger RL-trained teacher, then given its own RL instruct pass. The combination is remarkably sample-efficient for narrow task domains.
Distillation is especially valuable when:
- You need a model small enough to run locally or on edge devices
- Your task is well-defined and the teacher model performs it well
- You want to avoid the cost of RL training on the student directly
- Inference cost is a hard constraint (smaller model = faster + cheaper)
Fine-Tuning vs RAG vs Prompting: A Decision Matrix
This is the question practitioners get wrong most often. The answer is not "one is better" — the three approaches solve different problems and are often combined.
| Dimension | Prompting | RAG | Fine-Tuning |
|---|---|---|---|
| Knowledge type | General (whatever base model knows) | Dynamic, retrieved from external store | Static, baked into weights |
| Data freshness | Real-time via prompt | Real-time via retrieval | Stale after training |
| Setup cost | Minimal | Medium (build retrieval pipeline) | High (dataset + training) |
| Inference cost | Higher (long prompts) | Medium (retrieval + shorter prompt) | Lower (short prompts) |
| Behavior consistency | Variable (prompt sensitive) | Variable | High (baked in) |
| Best for | Flexible tasks, unclear requirements | Changing data, document Q&A | Stable tasks, format/style/domain |
When to choose each
Use prompting when: requirements are still evolving, you need to handle a wide variety of tasks, or the task involves general reasoning where the base model already performs adequately.
Use RAG when: the task requires factual information that changes frequently (product catalogs, documentation, news), or the knowledge base is too large to bake into weights, or you need citations and provenance for retrieved information.
Use fine-tuning when: you want consistent behavior across all invocations, the task has a specific format or style that prompts cannot reliably enforce, you have high inference volume and want shorter prompts to reduce costs, or you are working with proprietary domain knowledge that the base model genuinely lacks.
Combine fine-tuning + RAG when: you want consistent style and behavior (from fine-tuning) plus access to current factual information (from retrieval). A fine-tuned model with a retrieval layer is the architecture most production systems converge to.
For the broader question of whether to use open-source fine-tunable models vs closed API models, see Closed-source AI vs local open-source alternatives 2026.
Open-Source Fine-Tuning vs Closed API Fine-Tuning
The landscape for fine-tuning divides sharply between open-source models you train yourself and closed-model fine-tuning APIs offered by providers.
Open-source fine-tuning
Models like Meta Llama 3.3, Qwen 2.5, Mistral, and Gemma 3 can be fine-tuned on your own infrastructure or cloud compute. The workflow typically uses:
- Hugging Face Transformers + PEFT library for LoRA/QLoRA
- TRL (Transformer Reinforcement Learning) for SFT and RLHF
- Axolotl as a higher-level orchestration layer for fine-tuning runs
- Unsloth for significantly faster training with optimized CUDA kernels
You own the resulting weights. You can deploy anywhere, run locally, and fine-tune iteratively without per-call API costs. The tradeoff is infrastructure complexity and the cost of managing training compute.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.3-70B-Instruct",
load_in_4bit=True, # QLoRA: 4-bit base model
)
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
trainer = SFTTrainer(
model=model,
args=SFTConfig(
output_dir="./fine-tuned-model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
),
train_dataset=train_dataset,
)
trainer.train()
Closed-model fine-tuning APIs
OpenAI offered fine-tuning for GPT-3.5 and GPT-4 series models. The workflow was simpler — upload a JSONL file, call the fine-tuning API, get a fine-tuned model ID back — but the resulting weights were hosted by OpenAI. As covered in OpenAI Winds Down Fine-Tuning API, OpenAI announced in May 2026 that it was winding down its fine-tuning platform, giving customers until January 6, 2027 to create new training jobs.
Anthropic has taken a different position: there is no public fine-tuning API for Claude models. Anthropic's rationale centers on safety — fine-tuning is a mechanism by which carefully trained alignment properties can be degraded, intentionally or not, and Anthropic has not yet built the tooling to offer fine-tuning safely at scale while maintaining its safety standards. The company's approach to customization is instead through system prompts, long context, and its Models API.
What Fine-Tuning Cannot Do
Understanding the limits of fine-tuning is as important as understanding what it enables.
1. Fine-tuning cannot reliably add new factual knowledge
This is the most common misconception. If you train a model on 1,000 examples containing facts that were not in the pretraining data, the model may appear to learn those facts during training — but knowledge injection through fine-tuning is unreliable. The model is better at interpolating between things it already knows than at genuinely memorizing new facts from fine-tuning data.
Catastrophic forgetting compounds this: as a model fine-tunes heavily on new data, it can degrade performance on knowledge it had before. This is why RAG is the correct tool for knowledge that changes or that was never in pretraining, while fine-tuning is the correct tool for behavior and style adaptation.
2. Fine-tuning cannot make a weak base model strong
Fine-tuning extracts and sharpens capabilities the base model already has latent. If the base model cannot reason through multi-step legal analysis, fine-tuning it on 10,000 legal documents will improve its legal vocabulary and formatting — but not its fundamental reasoning quality.
The most important decision in any fine-tuning project is: choose the right base model first. A well-fine-tuned Llama 3.3 70B will outperform a heavily fine-tuned 7B model on complex tasks. The base model's pretraining scale sets the ceiling; fine-tuning moves you closer to that ceiling on a specific task.
3. Fine-tuning can introduce bias if the dataset is biased
Your fine-tuning dataset is a direct lever on model behavior. If your dataset systematically underrepresents certain cases, overrepresents a particular perspective, or contains errors, the fine-tuned model will reflect those biases more strongly than the base model did. A biased reward model in an RLHF pipeline will produce a biased fine-tuned model.
Dataset curation — reviewing examples, ensuring diversity of cases, checking for label errors — is not optional overhead. It is where most fine-tuning quality problems originate.
4. Fine-tuning has ongoing maintenance cost
Unlike prompting, which you can iterate on daily, a fine-tuned model requires retraining to update. If your task requirements evolve, you need to rebuild the dataset and retrain. For tasks that change rapidly, this maintenance overhead may exceed the inference savings.
Practical Guide: Running Your First Fine-Tuning Job
Step 1: Define the task precisely
Write down, in one paragraph, exactly what behavior you want the fine-tuned model to have that the base model plus a good system prompt does not reliably produce. If you cannot write that paragraph, you are not ready to fine-tune.
Step 2: Prepare the dataset
Format: JSONL with conversation turns, as shown in the SFT section above.
Size: Start with 100-500 examples. You can always add more. Starting too large means slow iteration cycles.
Quality checks:
- Manually review a random 10% sample before training
- Remove examples where the assistant output is wrong, unclear, or inconsistent
- Ensure the system prompt in your training data exactly matches what you will use at inference
- Balance the dataset — if 90% of examples are about one subtopic and 10% about another, the model will underperform on the minority
Split: Reserve 10-20% of examples as a hold-out evaluation set. Never train on your eval set.
Step 3: Choose your training setup
For open-source models with QLoRA:
| Hyperparameter | Typical starting value | Notes |
|---|---|---|
| Learning rate | 2e-4 to 5e-4 | Higher than full fine-tuning; LoRA adapters train faster |
| LoRA rank (r) | 16 | Increase to 32-64 for complex tasks |
| LoRA alpha | 2× rank | Controls scaling of the adapter output |
| Batch size | 4-8 per device | Increase with gradient accumulation if OOM |
| Epochs | 2-4 | More epochs = more overfitting risk on small datasets |
| Warmup ratio | 0.03 | Gradual LR warmup for stability |
| LR scheduler | cosine | Decays LR smoothly over training |
| Max sequence length | 2048-4096 | Match your inference context window |
Step 4: Evaluate
Do not rely on training loss alone. Evaluate on your held-out set with the same metrics you care about in production:
- Format adherence: Does the model consistently produce the expected output format?
- Domain accuracy: Does a domain expert rate the outputs as correct?
- Regression testing: Does the fine-tuned model still handle edge cases the base model handled well?
- A/B comparison: Have raters prefer the fine-tuned model vs the base model + system prompt?
Step 5: Iterate
Fine-tuning is iterative. The first run rarely produces the best model. Common issues and fixes:
| Problem | Likely cause | Fix |
|---|---|---|
| Model ignores system prompt | System prompt not in training data | Add system prompt to every training example |
| Outputs too short / too long | Training examples are too short / too long | Adjust training data length distribution |
| Model forgets base knowledge | Learning rate too high or too many epochs | Reduce LR, add max 3 epochs, use LoRA (freezes base) |
| Behavior inconsistent | Dataset too small or too noisy | Add examples, manually clean dataset |
| Format failures | Not enough format-critical examples | Add more examples that exercise the format |
The 2026 Context: RL-Based Post-Training and the Fine-Tuning Landscape
The fine-tuning story in 2026 is increasingly about RL-based post-training, not just SFT.
The pattern that labs discovered — and that smaller teams are now replicating with open-source models — is that reinforcement learning on verifiable outcomes is dramatically more efficient than SFT for tasks with clear success criteria. Coding (does the code pass the tests?), math (is the answer correct?), and tool use (did the tool call succeed?) all have reward signals that can be computed automatically. This removes the need for expensive human preference data.
The implication for fine-tuning practitioners: for tasks where you have a verifiable outcome, consider building an RL fine-tuning pipeline rather than stopping at SFT. The TRL library supports GRPO (Group Relative Policy Optimization), which is more stable than PPO for small-scale RL fine-tuning and requires less infrastructure.
The broader industry shift is also moving away from single large fine-tuned models toward specialized small models that are distilled from frontier RL checkpoints — as VibeThinker 3B illustrates. This is a different cost structure: instead of paying per-token to a frontier API, you pay once for training a small model and deploy it cheaply forever. For well-defined production tasks, this economics increasingly favors fine-tuning.
Meanwhile, OpenAI's wind-down of its fine-tuning API and Anthropic's absence from the fine-tuning market are pushing enterprise teams toward open-source model fine-tuning — a trend that is accelerating the maturity of open-source fine-tuning tooling.
Related Reading
- Scalable oversight: RLHF, Constitutional AI, and weak-to-strong generalization
- VibeThinker 3B: Opus 4.5 performance at 3B parameters via RL distillation
- What are parameters in a large language model?
- Closed-source AI vs local open-source alternatives 2026
- OpenAI winds down fine-tuning API: GPT-5.5 pricing and what to do next