In April 2025, Meta released Llama 4 — and the open-source AI landscape changed materially. The Scout model fits on a single high-end GPU, processes 10 million tokens of context, and approaches the performance of models that cost cents per token to access. The Maverick variant, a 400B-total-parameter Mixture of Experts model, outperformed GPT-4o on several major benchmarks at launch.
This guide gives you a complete picture of the Llama 4 family: how the models work, how to run them, how to fine-tune them, and when open weights are the right choice over a closed API.
Why Llama Matters
Meta's decision to release Llama's weights publicly is one of the most consequential choices in AI's recent history. Previous open-weight models were smaller and less capable than frontier closed models. Llama 4 changed that calculation.
Three things make Llama strategically important:
It decouples capability from API dependency. When you use GPT-4o or Claude, you depend on OpenAI or Anthropic for uptime, pricing, content policy, and data handling. With Llama 4's weights, you own the model. You can run it in an air-gapped environment, fine-tune it on proprietary data, and scale inference without per-token costs.
It changes the economics of AI at scale. At low to moderate volume, paying $0.30–$10 per million tokens is fine. At very high volume — millions of requests per day — the economics shift. A company processing 1 billion tokens per day at $1/M pays $30,000 per day. Running equivalent Llama 4 inference on owned hardware costs a fraction of that once infrastructure is amortized.
It enables customization that closed APIs cannot match. No amount of prompt engineering matches what fine-tuning on 50,000 domain-specific examples achieves. Llama 4's open weights make fine-tuning accessible.
Meta's open-source bet is also strategic: the more the AI ecosystem builds on Llama, the harder it is for OpenAI or Anthropic to achieve a monoculture. Meta doesn't sell AI APIs — it sells advertising, and a healthy AI ecosystem that keeps competitors from dominating cloud AI serves Meta's interests.
The Llama 4 Model Family
Llama 4 Scout
Scout is the model most people can realistically run. It uses a Mixture of Experts architecture with 17 billion active parameters across 16 experts (109 billion total parameters) and processes a 10 million token context window — the longest context window of any open-weight model at launch.
Scout was designed to fit on a single Nvidia H100 GPU with full-precision weights. With 4-bit quantization (Q4_K_M), it can run on consumer-grade 24GB GPUs like the RTX 4090, with a modest performance hit.
The 10M token context is genuinely useful for software engineering tasks — you can load entire large repositories as context without chunking. For research, you can load thousands of papers. No other open-weight model comes close to this context length.
Llama 4 Maverick
Maverick is the high-performance tier. It also has 17 billion active parameters but runs across 128 experts (400 billion total parameters). This larger expert pool means more specialized knowledge capacity without proportionally higher inference compute.
Maverick is natively multimodal — it handles text and images natively. At launch, it scored above GPT-4o and Gemini 2.0 Flash on MMLU, MATH, and image understanding benchmarks. Maverick requires multi-GPU infrastructure (typically 4–8 H100s) for comfortable inference.
Llama 4 Behemoth
Behemoth is not a deployable model in the traditional sense — it's Meta's teacher model, used to generate synthetic training data and distill knowledge into Scout and Maverick. With 288 billion active parameters and approximately 2 trillion total parameters across 16 experts, it's the largest model Meta has built.
As of mid-2026, Behemoth is still in preview/internal training. Meta has shared early benchmark numbers showing it outperforms GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Pro on STEM reasoning tasks. Whether and when Meta will release Behemoth's weights publicly is unclear — it may remain an internal teacher-only model.
Model Comparison Table
| Model | Active Params | Total Params | Experts | Context Window | Hardware Minimum | Multimodal |
|---|---|---|---|---|---|---|
| Llama 4 Scout | 17B | 109B | 16 | 10M tokens | 1× H100 (80GB) | Yes |
| Llama 4 Maverick | 17B | 400B | 128 | 1M tokens | 4× H100 | Yes |
| Llama 4 Behemoth | 288B | ~2T | 16 | TBD | Not public | Yes |
Mixture of Experts — What It Means for Llama 4
MoE is the architectural innovation that makes Llama 4's efficiency possible. Understanding it helps you make sense of what "17B active / 109B total" actually means.
How MoE Works
A standard dense model like GPT-3 (175B parameters) activates all 175B parameters for every single token it processes. That's enormously expensive.
A Mixture of Experts model replaces some dense layers with a set of parallel "expert" feed-forward networks plus a learned routing function. For each token, the router selects a small number of experts (typically 2–4 out of 16 or 128) to process that token. All other experts sit idle for that step.
The result: you get the knowledge capacity of a 109B or 400B parameter model (because those parameters exist and were trained) at the compute cost of roughly a 17B model (because only 17B activates per token). Memory still needs to hold all the weights, but compute per forward pass is much lower.
Why "17B Active" Matters in Practice
Inference speed: Throughput scales with active parameters, not total. Scout generates tokens at roughly 17B-model speed, not 109B-model speed.
Compute cost: FLOPS per forward pass are proportional to active parameters. Self-hosting Scout costs about as much compute as running a 17B dense model.
Memory requirements: You still need to store all 109B parameters in memory (VRAM or RAM). This is why Scout still requires ~80GB VRAM at full precision despite only computing with 17B parameters per token.
Quality from scale: The 128-expert pool in Maverick means the router can select genuinely specialized experts for different domains. A prompt about Python debugging might activate different experts than a prompt about French literature, even within the same forward pass.
Open Weights vs API Models — The Real Tradeoffs
Most teams default to API models because the path to "first working prototype" is shortest. But open weights change the calculus at scale or in constrained environments.
What Open Weights Give You
No per-token cost after infrastructure. Once you have a GPU server running Llama 4, inference is as cheap as electricity and amortized hardware.
Complete data privacy. Your prompts and completions never leave your infrastructure. This matters for healthcare (HIPAA), legal (privilege), finance (confidential data), and government use cases.
Full customization. You can fine-tune Llama 4 on your company's internal documents, support tickets, codebase, or domain-specific corpus. No closed API offers this.
No rate limits. Your throughput is bounded by hardware, not API quotas.
Reproducibility. Closed models are updated silently. If Claude Fable 5 changes in October, your production prompts may break. A pinned version of Llama 4 weights behaves identically forever.
What You Give Up
Model quality ceiling. Claude Fable 5 and GPT-5.6 Sol remain ahead of Llama 4 Maverick on complex reasoning, nuanced instruction following, and agentic tasks. The gap is narrowing but real.
Operational burden. You become responsible for uptime, scaling, monitoring, and updates.
Upfront cost. GPU infrastructure is expensive. Cloud H100s cost $2.50–$4.00/hour. If your volume is low, API models are cheaper.
When Open Weights Win
| Scenario | Recommendation |
|---|---|
| Processing <1M tokens/day | API model (lower cost) |
| Processing >100M tokens/day | Self-hosted Llama (lower marginal cost) |
| Healthcare / legal / classified data | Self-hosted (data never leaves) |
| No internet access environment | Self-hosted (only option) |
| Need custom fine-tuning | Self-hosted Llama |
| Need best reasoning quality | Claude Fable 5 or GPT-5.6 |
Running Llama 4 Locally
Hardware Requirements
| Setup | VRAM Required | Recommended GPU |
|---|---|---|
| Scout Q4_K_M (4-bit) | ~24GB | RTX 4090, A5000 |
| Scout full precision | ~80GB | 1× H100 |
| Maverick Q4_K_M | ~100GB | 2× A100 80GB |
| Maverick full precision | ~300GB+ | 4× H100 or 8× A100 |
Ollama — The Easiest Path
Ollama is a tool that handles model download, quantization selection, and local serving behind a simple CLI and REST API. It's the fastest path from zero to running Llama 4.
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Llama 4 Scout
ollama pull llama4:scout
# Start an interactive session
ollama run llama4:scout
Ollama also exposes a REST API compatible with OpenAI's API format:
curl http://localhost:11434/api/generate -d '{
"model": "llama4:scout",
"prompt": "Explain the Mixture of Experts architecture in three paragraphs.",
"stream": false
}'
For use with OpenAI-compatible Python clients:
from openai import OpenAI
# Point the OpenAI client at your local Ollama instance
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but not validated
)
response = client.chat.completions.create(
model="llama4:scout",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What are the key differences between Llama 4 Scout and Maverick?"}
]
)
print(response.choices[0].message.content)
LM Studio — GUI for Local Models
If you prefer a graphical interface, LM Studio provides a desktop app (macOS, Windows, Linux) that handles downloading models from Hugging Face, selecting quantization levels, and running a local chat interface or API server. It's the easiest path for non-developers who want to run Llama 4 locally.
vLLM — Production Serving
For production workloads, vLLM is the standard for high-throughput inference. It implements PagedAttention (efficient KV cache management) and continuous batching (serving multiple requests simultaneously):
pip install vllm
# Serve Llama 4 Scout with vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--port 8000
vLLM exposes an OpenAI-compatible API at http://localhost:8000/v1, so any code written for OpenAI's SDK works with zero changes.
Downloading Model Weights
All official Llama 4 weights are hosted on Hugging Face at meta-llama/:
from huggingface_hub import snapshot_download
# Download Scout Instruct weights
snapshot_download(
repo_id="meta-llama/Llama-4-Scout-17B-16E-Instruct",
local_dir="./models/llama4-scout",
ignore_patterns=["*.bin"] # Only download safetensors format
)
You'll need to accept Meta's Responsible Use Policy on Hugging Face (one-time, free) before the download begins.
Llama 4 Benchmarks — Honest Comparison
At launch, Meta published benchmark numbers comparing Llama 4 to closed models. Here's a more grounded picture.
Where Llama 4 Holds Up
| Benchmark | Llama 4 Scout | Llama 4 Maverick | GPT-4o | Claude Fable 5 |
|---|---|---|---|---|
| MMLU (General Knowledge) | 79.6 | 85.5 | 88.7 | 89.4 |
| MATH (Math Reasoning) | 74.3 | 84.8 | 76.6 | 88.2 |
| HumanEval (Coding) | 72.1 | 82.3 | 90.2 | 91.7 |
| MMMU (Multimodal) | 69.4 | 80.5 | 77.2 | 79.8 |
Note: Benchmarks are sensitive to evaluation methodology. Treat these as directional, not precise.
Key observations:
- Maverick is competitive with or slightly behind GPT-4o on general knowledge and multimodal tasks
- Both Llama 4 models trail Claude Fable 5 and GPT-5.6 Sol on coding and reasoning
- Scout is notably behind on HumanEval — it's not the first choice for complex code generation
- On MATH, Maverick matches or beats GPT-4o, which is surprising for an open-weight model
Where Llama 4 Still Trails
Complex multi-step reasoning, following nuanced multi-constraint instructions, and agentic tool-use tasks show the largest gaps versus Claude Fable 5 and GPT-5.6. These are areas where training methodology and RLHF alignment seem to matter more than raw parameter count.
If your use case depends heavily on precise instruction following (e.g., structured output generation, complex legal drafting), closed models still have an edge.
Llama 4 via API (Without Self-Hosting)
If you want Llama 4's open-weight economics without managing your own GPU infrastructure, several providers host Llama 4 as an API:
| Provider | Llama 4 Scout Input | Llama 4 Scout Output | Notes |
|---|---|---|---|
| Together AI | $0.18/M | $0.59/M | Fast inference, good reliability |
| Fireworks AI | $0.20/M | $0.70/M | Excellent latency, serverless |
| Groq | $0.11/M | $0.34/M | Fastest inference available |
| AWS Bedrock | $0.25/M | $0.85/M | Enterprise compliance, AWS native |
All of these expose OpenAI-compatible APIs, so migration is straightforward:
from openai import OpenAI
# Using Groq's API for fast Llama 4 inference
client = OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key="your-groq-api-key"
)
response = client.chat.completions.create(
model="llama4-scout-17b-16e-instruct",
messages=[
{"role": "user", "content": "Summarize the key advantages of MoE architecture."}
],
temperature=0.7
)
print(response.choices[0].message.content)
Fine-Tuning Llama 4 on Your Own Data
Fine-tuning is where open weights provide a genuine capability advantage over closed APIs. You can adapt Llama 4 to your specific domain, style, or task format in ways that prompting alone cannot achieve.
LoRA and QLoRA — The Efficient Path
Full fine-tuning of a 17B parameter model requires enormous GPU memory and training time. LoRA (Low-Rank Adaptation) and QLoRA (quantized LoRA) make this practical:
LoRA: Freezes the original model weights and trains small adapter matrices that learn the delta between the base model and your target behavior. Total trainable parameters are typically <1% of the model.
QLoRA: Extends LoRA by also quantizing the frozen base model to 4-bit, reducing memory requirements to the point where fine-tuning Llama 4 Scout is feasible on a single A100 80GB.
What You Need
- Dataset: Minimum ~500 high-quality instruction-response pairs; 5,000–50,000 for meaningful improvements
- Compute: For QLoRA on Scout — 1× A100 80GB; 2–4 hours for 1,000 steps
- Tools: Hugging Face
transformers,peft,trllibraries; or Unsloth (optimized LoRA training, 2–3× faster)
Example Fine-Tuning Setup (Unsloth)
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# Load Scout with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-4-Scout-17B-16E-Instruct",
max_seq_length=8192,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — higher = more capacity but more params
lora_alpha=16, # Scaling factor
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.0,
bias="none",
)
# Load your dataset (must be in instruction format)
dataset = load_dataset("json", data_files="my_training_data.jsonl")
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
dataset_text_field="text",
max_seq_length=8192,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
output_dir="./llama4-scout-finetuned",
fp16=True,
),
)
trainer.train()
model.save_pretrained("./llama4-scout-finetuned")
When Fine-Tuning Makes Sense
Fine-tuning pays off when:
- Your domain uses specialized terminology the base model handles poorly
- You need consistent output in a specific structured format (JSON schema, XML tags, etc.)
- You have 1,000+ high-quality examples of the exact behavior you want
- You've confirmed that few-shot prompting doesn't achieve acceptable quality
Do not fine-tune just to "add knowledge." RAG (retrieval-augmented generation) is more efficient for adding facts. Fine-tune to change behavior, format, or style.
The Llama Ecosystem
Hugging Face
Hugging Face is the primary hub for Llama 4 weights, derivatives, and community fine-tunes. The model hub hosts everything from the official Meta releases to thousands of community fine-tunes, quantized versions (GGUF for CPU inference, AWQ and GPTQ for GPU), and multimodal adaptations.
Key repositories to know:
meta-llama/Llama-4-Scout-17B-16E-Instruct— official Scout instruct weightsmeta-llama/Llama-4-Maverick-17B-128E-Instruct— official Maverick instruct weightsunsloth/Llama-4-Scout-17B-16E-Instruct-bnb-4bit— pre-quantized Scout for fast QLoRA training
Meta AI Studio
Meta's own fine-tuning and deployment platform, accessible through meta.ai. Meta AI Studio lets you fine-tune Llama models on custom datasets through a web interface without writing training code. It's positioned for non-developers who want customization without infrastructure management.
Community Derivatives
The open-weight nature of Llama means the community builds on it aggressively. By mid-2026, thousands of Llama 4 derivatives exist on Hugging Face for specific domains: medical transcription, legal contract analysis, code generation in specific frameworks, customer service in multiple languages. Before fine-tuning from scratch, check whether a community fine-tune for your domain already exists.
When to Choose Llama Over Closed Models
Choose Llama 4 when:
- Data privacy is non-negotiable (healthcare, legal, intelligence, financial confidential)
- You're processing >50M tokens/day and compute costs matter
- You need custom fine-tuning on proprietary data
- You're operating in an air-gapped or offline environment
- You want to avoid vendor lock-in for a strategic capability
- You have the engineering team to manage inference infrastructure
Choose Claude Fable 5 or GPT-5.6 when:
- You need best-in-class reasoning for complex agentic tasks
- Your volume is low to moderate (API cost is acceptable)
- Fast time-to-production matters more than cost optimization
- You need the latest model capabilities with minimal engineering overhead
- Your use case depends on precise instruction following at the frontier
Getting Started Today
The fastest path to running Llama 4:
# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull Scout (downloads ~50GB quantized)
ollama pull llama4:scout
# 3. Chat with it
ollama run llama4:scout "What can you help me with today?"
For production workloads: evaluate Together AI or Groq for hosted inference, then migrate to self-hosted vLLM when your volume justifies the infrastructure investment.
For fine-tuning: start with Unsloth on a rented A100 (Lambda Labs, Vast.ai, or RunPod) rather than buying hardware until you've validated that fine-tuning improves your specific use case.
Llama 4 is the strongest evidence yet that open-weight models can compete with — and in some dimensions surpass — closed frontier models. Scout's 10M context window and Maverick's benchmark performance represent a step change from previous open-weight generations. If you're building AI products and haven't evaluated Llama 4, now is the time.