What is a parameter in a large language model?

A parameter is a learnable number (weight or bias) in the model’s neural network—typically the entries of large weight matrices. Training adjusts billions of these numbers so the model maps token sequences to predictions. 'Parameter count' is the total size of the trained weight tensors (often in billions, written B or bn).

What is the difference between total parameters and active parameters?

In a dense model, all parameters participate in a forward pass. In a mixture-of-experts (MoE) design, the network can have many 'experts' with separate weights, but a router sends each token to only a small subset, so the computer only runs a slice of the weights per step: those are the active (or activated) parameters. Total can be hundreds of billions while active is far smaller, which changes cost and speed.

Why do many API models not publish a parameter count?

OpenAI, Anthropic, and others often do not state exact size for their newest frontier models, treating it as a competitive and security-sensitive detail. Product documentation still lists behavior, context length, and pricing. For architecture discussions, the field relies on vetted public releases, academic papers, and third-party analysis—not rumors.

Do more parameters always mean a better model?

Not automatically. Data quality, training recipe, context length, alignment, and inference quality all matter. Parameters set an upper bound on how much the architecture can store and transform, but a smaller, well-tuned model can beat a sloppier larger one on specific tasks. MoE and routing add another wrinkle: 'total' size is a worse proxy for 'cost per request' than active width.

Where can I read more about how tokens and billing relate?

Start with /blog/what-are-llm-tokens for what tokens are, and /blog/caveman-token-compression for how token economics shows up in production.

What are parameters in a large language model? Billions, MoE, and what 2026 model cards really say | explainx.ai Blog

Parameters (often billions of parameters, or B / bn) are the usual shorthand for how big a neural language model is in terms of learned weights. They are not the same as tokens (text units) and not the same as context length (how much text fits in one request)—but all three get compared when people discuss GPT-class, Claude-class, Gemini-class, and open weights like Llama.

This article hews to what vendors publish in 2026 and to one open line with full public tables: Meta’s Llama 4 on the official model card. Frontier APIs often list behavior and limits without a single headline parameter count; we cover why below.

What “parameters” means in one paragraph

A transformer-style LLM is a stack of layers that transform vectors representing tokens. Pretraining and fine-tuning adjust the entries of large weight matrices (and related biases) so the model improves at next-token prediction, tool use, or multimodal tasks—depending on the architecture.

Parameter count is how many of those scalar weights sit in the shipped checkpoint. Some model cards also break out separate totals for a tokenizer, vision tower, or audio encoder—read the specific card for the SKU you run.

Why people still talk in “billions”

Capacity — With similar data and training, a larger weight budget can represent richer patterns; in practice, data and recipe still dominate outcomes.
Serving cost — More weights (especially active per forward pass) tend to mean more FLOPs and memory at inference, though quantization and hardware matter.
MoE (mixture of experts) — A model can have a huge total while routing each token through only a subset of “expert” blocks, so active width is the better first-order handle on per-step compute.

Scaling laws in research usually relate loss to compute, data, and size together; a headline “B” count is one line in a larger system.

Total vs “active” parameters: MoE in plain terms

In a dense model, “70B parameters” generally means on the order of 70B weights on the main path of each token (implementation details aside).

In an MoE design, many parallel feedforward experts exist, but a router sends each token to one or a few of them. Cards often list total parameters (all experts) and activated parameters (roughly what runs for a typical forward pass).

Meta Llama 4 (from the model card table, April 2025 release; confirm on GitHub for updates):

Model	Activated (per card)	Total (per card)	Context length (per card)
Llama 4 Scout (17B × 16E)	17B	109B	10M tokens
Llama 4 Maverick (17B × 128E)	17B	400B	1M tokens

E denotes expert count in Meta’s notation. Scout and Maverick share the same activated width in this table but differ in total size and in context length by design. Always re-read the model card for the exact checkpoint you deploy.

Frontier API models: strong specs, often no public parameter line

OpenAI documents GPT-5.4 and lists model behavior, context, and API model pages—without a public total parameter count in the same way open releases do.
Anthropic publishes Claude Opus 4.7 and a models overview with context, pricing, and features—not a “N billion parameters” headline.
Google DeepMind lists Gemini 3.1 Pro capabilities, modalities, and context—again, typically without a full parameter count in the consumer-facing card.

If you see a billion-scale number for a closed model in a third-party post, treat it as analysis or speculation unless the vendor or a vetted system card states it.

How to use parameter counts in practice

Open-weight models (Llama, others): the model card, license, and memory notes tell you if a run fits your GPUs—active size and quantization usually matter more than a huge MoE total for download size vs runtime.
APIs: use vendor docs for latency, context window, tools, and $/M tokens (tokens explainer, Caveman economics).
Benchmarks: treat headline size as weak evidence without measurements on your task and data.

What are parameters in a large language model? Billions, MoE, and what 2026 model cards really say

What “parameters” means in one paragraph

Why people still talk in “billions”

Total vs “active” parameters: MoE in plain terms

Frontier API models: strong specs, often no public parameter line

How to use parameter counts in practice

Read next

Related posts

What is a context window? LLM 'working memory' and a 2026 snapshot of top models

What are tokens? A plain guide to how LLMs count (and charge for) text

Anthropic Project Deal: Claude AI Agents Negotiate 186 Deals in Office Marketplace Experiment