Proxy-KD (Proxy-based Knowledge Distillation) is a method from Chen et al. (arXiv:2401.07013, revised November 2024) that distills capability from closed API models like GPT-4 into smaller open models. It inserts a white-box proxy LLM between the black-box teacher and the student: the proxy aligns to the teacher's outputs (including via DPO preference optimization), then generates token-level distributions the student can learn from — approximating white-box logit distillation without access to the teacher's weights.

Why is Proxy-KD trending after the Fable 5 distillation scandal?

Hacker News resurfaced the paper in June 2026 after Anthropic accused Alibaba-linked operators of running ~25,000 fraudulent accounts and 28.8 million Claude API exchanges (April 22–June 5) to extract frontier capabilities for Qwen. Commenters noted the irony: Alibaba co-authored research on improving black-box distillation while Anthropic alleges industrial-scale extraction from Claude Fable-class models.

Who wrote the Proxy-KD paper?

Hongzhan Chen, Runjun Chen, Yuqi Yi, Xiaojun Quan (Sun Yat-sen University), Chenliang Li, Ming Yan, and Ji Zhang (Alibaba Group). The paper was submitted January 13, 2024 (v1) and last revised November 9, 2024 (v2). PDF: https://arxiv.org/pdf/2401.07013

How is Proxy-KD different from Alpaca-style distillation?

Alpaca-style methods fine-tune a student on hard text outputs from a black-box teacher — response distillation only. Proxy-KD adds an intermediate proxy that learns soft token distributions aligned to the teacher, then distills those distributions to the student with sample-level weighting. The authors report Proxy-KD beating both plain black-box KD and traditional white-box KD on their benchmarks.

Is API distillation legal?

Research publication is legal; industrial extraction at scale against API terms of service is not. Anthropic and OpenAI explicitly prohibit using API outputs to train competing models. Anthropic's June 10, 2026 Senate Banking letter framed the Alibaba-linked campaign as the largest distillation attack it has detected — a policy and enforcement issue separate from the underlying ML technique.

What is Pre-trained Distillation from Turc et al. 2019?

Turc, Chang, Lee, and Toutanova (arXiv:1908.08962) showed that pre-training compact models before distilling from a larger fine-tuned teacher beats many compression tricks — and that pre-training plus distillation compound even on the same data. HN commenters cited it alongside Proxy-KD as foundational reading on why small models + teacher outputs punch above their weight.

Proxy-KD: black-box LLM distillation explained | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Proxy-KD: black-box LLM distillation explained | explainx.ai Blog | explainx.ai

A January 2024 research paper is back in circulation on Hacker News — not because it is new, but because it describes exactly the capability-extraction playbook at the center of Anthropic's Fable 5 distillation scandal.

The paper: Knowledge Distillation of Black-Box Large Language Models (Chen et al., arXiv:2401.07013, v2 revised November 9, 2024). The method: Proxy-KD. The authors: researchers from Sun Yat-sen University and Alibaba Group.

Given that Anthropic accused Alibaba-linked operators of running ~25,000 fraudulent accounts and 28.8 million Claude exchanges to distill frontier capabilities into Qwen — the HN thread's subtext is obvious. One commenter put it plainly: "My best guess is this is a reference to the recent accusations from Anthropic of Chinese labs distilling on their models."

This post explains what Proxy-KD actually does, how it differs from the crude extraction Anthropic described, and why the Fable 5 export ban sits downstream of the same economic force.

TL;DR — questions people are asking

Question	Answer

Type	Teacher access	What the student learns	Examples
White-box KD	Weights + logits + hidden states	Full distributions, features, attention	DistilBERT, self-distillation within open models
Black-box KD	API outputs only	Hard completions, rationales	Alpaca (GPT-3.5), Vicuna, Orca
Proxy-KD	API outputs → aligned proxy → distributions	Soft distributions approximating closed teacher	Chen et al. 2024

text

Black-box teacher (GPT-4, Claude API)
        ↓  outputs + preference pairs
   Proxy model (white-box LLM)
        ↓  aligned soft distributions + sample weights
   Student model (smaller open LLM)

Era	Method	Teacher	What transferred
2019	Pre-trained Distillation (Turc et al.)	Fine-tuned BERT-large	Task knowledge into mini-BERT after pre-training
2023	Alpaca	GPT-3.5 API	52K instruction examples → LLaMA-7B
2023	Vicuna / Orca	GPT-4 / ChatGPT API	Conversations + reasoning traces
2024	Proxy-KD	GPT-4 API (black-box)	Distributions via aligned proxy
2025	DeepSeek R1 distill	Open R1 traces	Reasoning behavior into 1.5B–70B students
2026	Alleged Qwen extraction	Claude API (Fable-era)	28.8M exchanges, agentic + coding (Anthropic claim)

Dimension	Proxy-KD (research)	Alleged Claude extraction (Anthropic)
Scale	Benchmark datasets	28.8 million exchanges
Access	API calls under research budget	~25,000 fraudulent accounts, evasion infrastructure
Goal	Publish method beating baselines	Train production rival models (Qwen)
Legal frame	Academic citation	ToS violation + lawmaker briefing
Technique	Proxy alignment + weighted KL	Unknown — may be simpler response copying at volume
Authors	Alibaba + university	Operators linked to Alibaba Qwen (Anthropic claim)

Paper	arXiv	Why read it
Proxy-KD	2401.07013	Black-box LLM distillation via aligned proxy + DPO
Pre-trained Distillation	1908.08962	Pre-train student first; compound gains (HN recommended)
Distilling the Knowledge in a Neural Network	1503.02531	Hinton original — dark knowledge, temperature softmax
DistilBERT	1910.01108	Production distillation milestone

Proxy-KD: How Black-Box LLM Distillation Works — and Why It Matters After the Fable 5 Extraction Scandal

TL;DR — questions people are asking

Related posts

What Is AI Distillation? Knowledge Transfer, Model Compression, and the Fable 5 Controversy

Cursor Agent Swarms: SQLite in Rust, Planner/Worker Economics

Anthropic Opens Rare Disease Research Grants: $50K in Claude Credits, Two Tracks

Black-box vs white-box distillation

How Proxy-KD works (arXiv:2401.07013)

Stage 1 — Proxy alignment

Stage 2 — Sample-level weighted distillation

Stage 3 — Student training

Reported results

The Alpaca lineage — and where Proxy-KD fits

Why HN resurfaced this now

Proxy-KD vs what Anthropic alleged

What defenders and rivals actually distill

Implications for developers and policy

If you build on closed APIs

If you build open models

If you follow the Fable ban

Key papers to read

Bottom line

TL;DR — questions people are asking

Related posts

What Is AI Distillation? Knowledge Transfer, Model Compression, and the Fable 5 Controversy

Cursor Agent Swarms: SQLite in Rust, Planner/Worker Economics

Anthropic Opens Rare Disease Research Grants: $50K in Claude Credits, Two Tracks

Black-box vs white-box distillation

How Proxy-KD works (arXiv:2401.07013)

Stage 1 — Proxy alignment

Stage 2 — Sample-level weighted distillation

Stage 3 — Student training

Reported results

The Alpaca lineage — and where Proxy-KD fits

Why HN resurfaced this now

Proxy-KD vs what Anthropic alleged

What defenders and rivals actually distill

Implications for developers and policy

If you build on closed APIs

If you build open models

If you follow the Fable ban

Key papers to read

Bottom line

Related reading