What are the Llama 4 model variants?

Llama 4 comes in three models: Scout (17B active parameters, 109B total, 10M token context, runs on a single H100), Maverick (17B active, 400B total MoE, strong multimodal performance), and Behemoth (288B active, ~2T total, still in preview/training, used as a teacher model for Scout and Maverick).

What does "open weight" mean for Llama 4?

Open weight means Meta releases the trained model weights publicly so you can download and run the model yourself — on your own hardware or cloud infrastructure. It's different from fully open source (where training code and data are also shared), and it's different from a closed API (where you never see the weights). You control the model, but Meta's Responsible Use Policy restricts certain commercial uses.

Can I run Llama 4 on a consumer GPU?

Llama 4 Scout with 4-bit quantization runs on a single 24GB GPU (like an RTX 4090) with some speed tradeoff. Full-precision Scout requires ~80GB VRAM (one H100 or two A100s). Maverick requires multiple high-end GPUs or a multi-GPU server. Flash-Lite is not practical for consumer hardware.

Is Llama 4 free to use commercially?

Llama 4 is available under Meta's Llama 4 Community License, which allows commercial use for most businesses. However, companies with more than 700 million monthly active users must request a separate license from Meta. Read the license at llama.com before building a commercial product.

How does Mixture of Experts make Llama 4 efficient?

MoE models contain many specialized "expert" sub-networks but only activate a small subset for each token. Llama 4 Scout has 16 experts but activates just a few per token, so the compute per token matches a ~17B model even though total parameters are 109B. This means you get the knowledge capacity of a much larger model at the inference cost of a smaller one.

What is fine-tuning and when should I do it with Llama 4?

Fine-tuning adapts a pretrained model to your specific task, style, or domain by training it on your own data. Use fine-tuning when the base model consistently misses domain-specific terminology, when you need a specific output format it can't learn from prompting alone, or when you need to embed proprietary knowledge. LoRA/QLoRA makes this practical without full-model training costs.

What is the difference between self-hosting Llama and using it via API?

Self-hosting means you run the model on your own hardware (or rented cloud GPUs) — full control, no per-token cost, complete data privacy, and ability to fine-tune. Using Llama via API (Together AI, Groq, Fireworks) is easier to start with, you pay per token, but you have less data control. Self-hosting makes economic sense at high volume and when data privacy is non-negotiable.

How does Llama 4 compare to Claude Fable 5 and GPT-5.6?

On key benchmarks, Llama 4 Maverick is competitive with GPT-4o and Gemini 2.0 Flash. Claude Fable 5 and GPT-5.6 Sol outperform Llama 4 on complex reasoning, instruction following, and agentic tasks — but they cost money per token and you cannot customize them. For cost-sensitive production workloads where you can self-host, Llama 4 closes the gap substantially compared to previous open-weight generations.

Meta Llama 4 Complete Guide 2026 — Scout, Maverick, Local Setup, and Benchmarks | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Meta Llama 4 Complete Guide 2026 — Scout, Maverick, Local Setup, and Benchmarks | explainx.ai Blog | explainx.ai

In April 2025, Meta released Llama 4 — and the open-source AI landscape changed materially. The Scout model fits on a single high-end GPU, processes 10 million tokens of context, and approaches the performance of models that cost cents per token to access. The Maverick variant, a 400B-total-parameter Mixture of Experts model, outperformed GPT-4o on several major benchmarks at launch.

This guide gives you a complete picture of the Llama 4 family: how the models work, how to run them, how to fine-tune them, and when open weights are the right choice over a closed API.

Why Llama Matters

Meta's decision to release Llama's weights publicly is one of the most consequential choices in AI's recent history. Previous open-weight models were smaller and less capable than frontier closed models. Llama 4 changed that calculation.

Three things make Llama strategically important:

It decouples capability from API dependency. When you use GPT-4o or Claude, you depend on OpenAI or Anthropic for uptime, pricing, content policy, and data handling. With Llama 4's weights, you own the model. You can run it in an air-gapped environment, fine-tune it on proprietary data, and scale inference without per-token costs.

It changes the economics of AI at scale. At low to moderate volume, paying $0.30–$10 per million tokens is fine. At very high volume — millions of requests per day — the economics shift. A company processing 1 billion tokens per day at $1/M pays $30,000 per day. Running equivalent Llama 4 inference on owned hardware costs a fraction of that once infrastructure is amortized.

It enables customization that closed APIs cannot match. No amount of prompt engineering matches what fine-tuning on 50,000 domain-specific examples achieves. Llama 4's open weights make fine-tuning accessible.

Meta's open-source bet is also strategic: the more the AI ecosystem builds on Llama, the harder it is for OpenAI or Anthropic to achieve a monoculture. Meta doesn't sell AI APIs — it sells advertising, and a healthy AI ecosystem that keeps competitors from dominating cloud AI serves Meta's interests.

The Llama 4 Model Family

Llama 4 Scout

Scout is the model most people can realistically run. It uses a Mixture of Experts architecture with 17 billion active parameters across 16 experts (109 billion total parameters) and processes a 10 million token context window — the longest context window of any open-weight model at launch.

Scout was designed to fit on a single Nvidia H100 GPU with full-precision weights. With 4-bit quantization (Q4_K_M), it can run on consumer-grade 24GB GPUs like the RTX 4090, with a modest performance hit.

The 10M token context is genuinely useful for software engineering tasks — you can load entire large repositories as context without chunking. For research, you can load thousands of papers. No other open-weight model comes close to this context length.

Model	Active Params	Total Params	Experts	Context Window	Hardware Minimum	Multimodal
Llama 4 Scout	17B	109B	16	10M tokens	1× H100 (80GB)	Yes
Llama 4 Maverick	17B	400B	128	1M tokens	4× H100	Yes
Llama 4 Behemoth	288B	~2T	16	TBD	Not public	Yes

Scenario	Recommendation
Processing <1M tokens/day	API model (lower cost)
Processing >100M tokens/day	Self-hosted Llama (lower marginal cost)
Healthcare / legal / classified data	Self-hosted (data never leaves)
No internet access environment	Self-hosted (only option)
Need custom fine-tuning	Self-hosted Llama
Need best reasoning quality	Claude Fable 5 or GPT-5.6

Setup	VRAM Required	Recommended GPU
Scout Q4_K_M (4-bit)	~24GB	RTX 4090, A5000
Scout full precision	~80GB	1× H100
Maverick Q4_K_M	~100GB	2× A100 80GB
Maverick full precision	~300GB+	4× H100 or 8× A100

python

from openai import OpenAI

# Point the OpenAI client at your local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not validated
)

response = client.chat.completions.create(
    model="llama4:scout",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "What are the key differences between Llama 4 Scout and Maverick?"}
    ]
)

print(response.choices[0].message.content)

bash

pip install vllm

# Serve Llama 4 Scout with vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --port 8000

python

from huggingface_hub import snapshot_download

# Download Scout Instruct weights
snapshot_download(
    repo_id="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    local_dir="./models/llama4-scout",
    ignore_patterns=["*.bin"]  # Only download safetensors format
)

Benchmark	Llama 4 Scout	Llama 4 Maverick	GPT-4o	Claude Fable 5
MMLU (General Knowledge)	79.6	85.5	88.7	89.4
MATH (Math Reasoning)	74.3	84.8	76.6	88.2
HumanEval (Coding)	72.1	82.3	90.2	91.7
MMMU (Multimodal)	69.4	80.5	77.2	79.8

Provider	Llama 4 Scout Input	Llama 4 Scout Output	Notes
Together AI	$0.18/M	$0.59/M	Fast inference, good reliability
Fireworks AI	$0.20/M	$0.70/M	Excellent latency, serverless
Groq	$0.11/M	$0.34/M	Fastest inference available
AWS Bedrock	$0.25/M	$0.85/M	Enterprise compliance, AWS native

python

from openai import OpenAI

# Using Groq's API for fast Llama 4 inference
client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-groq-api-key"
)

response = client.chat.completions.create(
    model="llama4-scout-17b-16e-instruct",
    messages=[
        {"role": "user", "content": "Summarize the key advantages of MoE architecture."}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

python

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load Scout with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    max_seq_length=8192,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,             # LoRA rank — higher = more capacity but more params
    lora_alpha=16,    # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.0,
    bias="none",
)

# Load your dataset (must be in instruction format)
dataset = load_dataset("json", data_files="my_training_data.jsonl")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=8192,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        output_dir="./llama4-scout-finetuned",
        fp16=True,
    ),
)

trainer.train()
model.save_pretrained("./llama4-scout-finetuned")

bash

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull Scout (downloads ~50GB quantized)
ollama pull llama4:scout

# 3. Chat with it
ollama run llama4:scout "What can you help me with today?"

Meta Llama 4: The Complete Open-Source AI Model Guide 2026

Why Llama Matters

The Llama 4 Model Family

Llama 4 Scout

Related posts

How to Run Open Source Models Locally and Wire Them Into OpenCode (2026)

Hermes WebUI: The Self-Hosted AI Agent Interface That Remembers Everything (2026 Complete Guide)

Qwen 3.8-Max Preview: 2.4T Params, Token Plan Pricing, and Open Weights Soon

Llama 4 Maverick

Llama 4 Behemoth

Model Comparison Table

Mixture of Experts — What It Means for Llama 4

How MoE Works

Why "17B Active" Matters in Practice

Open Weights vs API Models — The Real Tradeoffs

What Open Weights Give You

What You Give Up

When Open Weights Win

Running Llama 4 Locally

Hardware Requirements

Ollama — The Easiest Path

LM Studio — GUI for Local Models

vLLM — Production Serving

Downloading Model Weights

Llama 4 Benchmarks — Honest Comparison

Where Llama 4 Holds Up

Where Llama 4 Still Trails

Llama 4 via API (Without Self-Hosting)

Fine-Tuning Llama 4 on Your Own Data

LoRA and QLoRA — The Efficient Path

What You Need

Example Fine-Tuning Setup (Unsloth)

When Fine-Tuning Makes Sense

The Llama Ecosystem

Hugging Face

Meta AI Studio

Community Derivatives

When to Choose Llama Over Closed Models

Getting Started Today

Read next