What is knowledge distillation in AI?

Knowledge distillation is a technique where a smaller "student" model is trained to replicate the outputs and behavior of a larger "teacher" model. Rather than learning from raw data alone, the student learns from the teacher's soft probability distributions — which contain richer signal than hard labels. The result is a model that is faster and cheaper to run while retaining much of the teacher's capability.

Is distillation legal?

It depends on the terms of service of the model being distilled and whether the distillation involves unauthorized access. Distilling from publicly available open-weight models is generally legal. Distilling from a closed API by systematically querying it to collect training data is explicitly prohibited by most major AI providers' terms of service — and Anthropic has pursued legal action against Alibaba for doing exactly this with Claude.

How did Alibaba distill Claude Fable 5?

According to Anthropic's legal complaint, operators linked to Alibaba's Qwen team created nearly 25,000 fraudulent accounts and used them to run 28.8 million Claude API interactions. These interactions were used as training data to improve Qwen models — effectively extracting Claude's knowledge through its API without permission.

What is the difference between distillation and fine-tuning?

Fine-tuning adapts a pre-trained model to a specific task using new labeled data. Distillation transfers general capability from a larger model to a smaller one using the larger model's outputs as training signal. Fine-tuning changes what a model knows; distillation changes how efficiently it knows it.

Can you distill Claude or GPT-4 legally?

Not via their APIs. Both Anthropic and OpenAI explicitly prohibit using API outputs to train competing models. Open-weight models like Llama 4 or DeepSeek V4 can be legally distilled under their respective licenses. Some licenses (Apache 2.0, MIT) allow commercial distillation; others (like some Meta Llama licenses) restrict commercial use above certain scales.

What Is AI Distillation? Knowledge Transfer, Model | explainx.ai Blog

← Back to blog

TL;DR

Distillation is the technique of training a small, efficient "student" model to replicate the behavior of a large, expensive "teacher" model. It's one of the foundational techniques that has made modern AI practical — almost every small model you run locally is partially a product of distillation. In 2026, it's also at the centre of one of the most significant legal disputes in AI: Anthropic accused Alibaba of systematically distilling Claude Fable 5 using 25,000 fake accounts and 28.8 million unauthorized API calls.

The Core Idea: Teacher and Student

The original knowledge distillation paper — Hinton, Vinyals, and Dean, 2015 — introduced a deceptively simple idea: you can learn more from a model's probability distribution than from its final answer.

When a classifier labels an image of a dog, the hard answer is just "dog." But the model's full probability output across all classes might look like: dog 92%, wolf 5%, cat 2%, other 1%. That soft distribution encodes something the hard label doesn't — the model knows a dog is more similar to a wolf than to a cat. A student trained on these soft probabilities learns more efficiently than one trained on raw labels alone.

Geoffrey Hinton called this "dark knowledge" — the information embedded in the wrong answers that reveals how the model internally represents the world.

How Distillation Works in Practice

Step 1: Pick a teacher

The teacher is a large, capable model — GPT-4, Claude Fable 5, Llama 4 70B, whatever. It's expensive to run but highly capable.

Step 2: Generate outputs from the teacher

You feed inputs through the teacher and collect its outputs. For a language model, this means collecting the token probability distributions (logits) across thousands or millions of prompts. The richer and more diverse the prompt distribution, the more of the teacher's knowledge you can transfer.

Step 3: Train the student

The student is a smaller architecture — maybe 7B parameters instead of 70B. You train it to minimize the difference between its own output distributions and the teacher's. The loss function combines:

Distillation loss — how closely the student's soft outputs match the teacher's

Task loss — how well the student performs on ground truth labels (optional, depending on the approach)

Step 4: Evaluate and iterate

The student won't be as capable as the teacher on everything, but it will be much faster and cheaper to run. The goal is maximizing capability retention per unit of compute cost.

Model	Base	Parameters	Distillation Type	License
DeepSeek R1-Distill-70B	DeepSeek R1	70B	Chain-of-thought	MIT
DeepSeek R1-Distill-7B	DeepSeek R1	7B	Chain-of-thought	MIT
Phi-4 (Microsoft)	GPT-4 synthetic data	14B	Response distillation	MIT
DistilBERT	BERT-base	66M	Feature + logit	Apache 2.0
Llama 3.1-8B	Llama 3.1-70B (partially)	8B	Mixed	Meta license

What Is AI Distillation? Knowledge Transfer, Model Compression, and the Fable 5 Controversy

TL;DR

The Core Idea: Teacher and Student

How Distillation Works in Practice

Step 1: Pick a teacher

Step 2: Generate outputs from the teacher

Step 3: Train the student

Step 4: Evaluate and iterate

A Brief History of Distillation

Related posts

Proxy-KD: How Black-Box LLM Distillation Works — and Why It Matters After the Fable 5 Extraction Scandal

Anthropic vs Alibaba: 25,000 Fake Accounts and 28.8M Claude Exchanges

Fable 5 Returns to Subscriptions July 20 — Max & Team Premium at 50%, Pro Gets $100 Credit

2006 — Model compression

2015 — "Distilling the Knowledge in a Neural Network" (Hinton et al.)

2019 — DistilBERT (Hugging Face)

2020–2022 — GPT distillation at scale

2023 — Alpaca and the era of instruction distillation

2024 — Phi models (Microsoft)

2025 — DeepSeek R1 and reasoning distillation

2026 — Anthropic sues Alibaba over Fable 5 distillation

Types of Distillation

Response distillation (black-box)

Logit distillation (white-box)

Feature distillation

Chain-of-thought distillation

Speculative decoding (a different kind of distillation use)

Why Distillation Is So Powerful

Capability compression

Data efficiency

Reasoning transfer

Cost collapse

The Legal and Ethical Dimension

Scenario 1: Distilling your own open-weight model (clearly fine)

Scenario 2: Distilling from open-weight models with restrictive licenses (depends)

Scenario 3: Distilling from closed APIs (prohibited)

Distillation and the Open-Source Debate

Fable 5 in Context: What Was Being Distilled

Practical Distillation: What's Actually Available Today

Bottom Line