Proxy-KD: How Black-Box LLM Distillation Works β and Why It Matters After the Fable 5 Extraction Scandal
Alibaba-linked researchers published Proxy-KD (arXiv:2401.07013) to distill GPT-4-class models without weight access. After Anthropic accused Qwen of 28.8M Claude exchanges, HN resurfaced the paper. Full breakdown.
A January 2024 research paper is back in circulation on Hacker News β not because it is new, but because it describes exactly the capability-extraction playbook at the center of Anthropic's Fable 5 distillation scandal.
Given that Anthropic accused Alibaba-linked operators of running ~25,000 fraudulent accounts and 28.8 million Claude exchanges to distill frontier capabilities into Qwen β the HN thread's subtext is obvious. One commenter put it plainly: "My best guess is this is a reference to the recent accusations from Anthropic of Chinese labs distilling on their models."
This post explains what Proxy-KD actually does, how it differs from the crude extraction Anthropic described, and why the Fable 5 export ban sits downstream of the same economic force.
TL;DR β questions people are asking
Question
Answer
What problem does Proxy-KD solve?
Closed models (GPT-4, Claude) are API-only β you get text, not logits. White-box distillation needs probability distributions. Proxy-KD bridges the gap with an intermediate open model.
How does it work?
(1) Collect black-box teacher outputs. (2) Align a white-box proxy to the teacher via SFT + DPO preference optimization. (3) Proxy generates soft distributions. (4) Student learns from weighted KL loss + hard labels.
Did Alibaba invent industrial Claude distillation?
The paper is legitimate research. Anthropic's allegation is about 25,000 fake accounts and 28.8M API calls β scale and ToS violation, not the existence of the technique.
Why does this connect to Fable 5?
Anthropic's June 10 Senate letter warned distillation could reach Mythos Preview-level capability. The June 12 export ban followed 48 hours later.
Is Proxy-KD what Qwen used?
Unknown publicly. Proxy-KD is one published method; alleged operators may have used simpler response-only distillation at industrial scale. The paper shows Alibaba researchers studied the problem formally.
Related classic paper?
Turc et al. 2019 β Well-Read Students Learn Better β pre-train the student first, then distill; effects compound.
Black-box vs white-box distillation
If you read explainx.ai's distillation primer, you already know the Hinton framing: students learn more from a teacher's soft probability distribution than from hard labels alone β Geoffrey Hinton's "dark knowledge."
The problem in 2024β2026: the best teachers are closed.
Type
Teacher access
What the student learns
Examples
White-box KD
Weights + logits + hidden states
Full distributions, features, attention
DistilBERT, self-distillation within open models
Black-box KD
API outputs only
Hard completions, rationales
Alpaca (GPT-3.5), Vicuna, Orca
Proxy-KD
API outputs β aligned proxy β distributions
Soft distributions approximating closed teacher
Chen et al. 2024
White-box distillation is more sample-efficient per example. Black-box distillation is all most labs can do with GPT-4 or Claude β unless they run 28.8 million exchanges and treat volume as a substitute for logits.
Proxy-KD tries to get white-box efficiency from black-box access.
How Proxy-KD works (arXiv:2401.07013)
The paper's architecture has three stages:
Black-box teacher (GPT-4, Claude API)
β outputs + preference pairs
Proxy model (white-box LLM)
β aligned soft distributions + sample weights
Student model (smaller open LLM)
Stage 1 β Proxy alignment
The proxy is an open-weight LLM (white-box). It is trained on:
Hard labels β text outputs from the black-box teacher
DPO preference optimization β chosen/rejected pairs where the teacher's output is the "winning" response
The goal: make the proxy's behavior and token-level distributions track the closed teacher as closely as possible without ever seeing the teacher's weights.
Stage 2 β Sample-level weighted distillation
Not all proxy outputs align equally well with the teacher. Proxy-KD assigns a per-sample weight reflecting alignment quality. The student concentrates learning on well-aligned distributions β ignoring proxy outputs that drift from the teacher.
Loss combines:
KL divergence between student and proxy distributions (weighted)
NLL loss on teacher hard labels (standard supervised fine-tuning)
Stage 3 β Student training
The student trains as if it had white-box access β learning from dense distributions the proxy synthesizes, not just final text strings.
Reported results
Chen et al. claim Proxy-KD outperforms both naive black-box KD (Alpaca-style output copying) and traditional white-box KD where the white-box teacher is smaller than the black-box target. Key finding: proxy alignment quality dominates β a poorly aligned proxy hurts distillation more than model size gaps.
Turc et al.'s insight β cited on HN as essential companion reading β still holds: pre-train the student, then distill. Pre-training and distillation have a compound effect even on the same data. Modern Chinese labs apply both: strong base models (Qwen, GLM) plus frontier teacher outputs.
Fable 5 + Mythos 5 export ban (June 12) β global suspension citing national security
Zhipu matching Mythos on security benchmarks (June 28) β capability gap closing outside US export control
The thread's tension mirrors the industry debate:
Technical view: Black-box distillation is published science. Proxy-KD is a better version of what Alpaca did. Scale + pre-training + efficient architectures (DeepSeek economics) explain Chinese catch-up β not magic.
Policy view: Industrial extraction via fake account farms violates ToS, evades billing, and accelerates rival capability on capabilities Washington treats as export-controlled (Fable status Day 18).
Market view: HN commenters debated whether developers should route spend to cheaper Chinese APIs to compress US AI margins β the same China playbook argument in research form.
Anthropic's framing in the Senate letter: distillation at this scale is not academic reproduction β it is "the largest known distillation attack" targeting agentic reasoning, software engineering, and long-horizon tasks β the same capability classes Fable and Mythos represent.
Proxy-KD vs what Anthropic alleged
Important distinction β conflating them makes bad policy and bad engineering decisions.
Unknown β may be simpler response copying at volume
Authors
Alibaba + university
Operators linked to Alibaba Qwen (Anthropic claim)
Proxy-KD explains why black-box distillation is hard and how to do it better. The scandal is about industrializing extraction against a closed frontier API β and doing it while Fable and Mythos were still live, before Anthropic's bot-detection and July 8 ID verification layer.
The Mythos detection irony: export controls target foreign-national access to cyber-capable models, while 25,000 bot accounts reportedly ran millions of exchanges undetected pre-ban.
What defenders and rivals actually distill
Not all distillation is equal. Production teams tier by what they extract:
Tier 1 β Instruction following (cheap, common)
Alpaca-style: collect (prompt, completion) pairs. Transfers chat behavior, not frontier reasoning. Legal risk: high on closed APIs.
Tier 2 β Chain-of-thought and tool traces (medium cost)
Orca / DeepSeek R1 pattern: collect reasoning steps, tool calls, multi-turn trajectories. Transfers how the model thinks. This is what Anthropic's letter emphasizes for agentic and coding workloads.
Tier 3 β Proxy-KD-style distribution matching (research-grade)
Requires proxy alignment infrastructure. Better capability retention per sample β but still needs diverse, high-quality teacher queries.
Tier 4 β White-box self-distillation (legal on open weights)
DeepSeek, Meta, Mistral distill within their own open models. Unambiguously permitted under permissive licenses.
The alleged Alibaba campaign, per Anthropic, targeted Tier 2+ at Tier 1 scale β millions of traces on Fable-class models.
Implications for developers and policy
If you build on closed APIs
Assume your outputs can and will be distilled if they are valuable. Terms of service are enforcement, not physics. Proxy-KD shows the technique improves every year.
Practical hedges:
Rate limiting and anomaly detection on account farms (Anthropic's July 8 KYC push)
Open-weight release of previous generations to commoditize the teacher role (Meta Llama pattern)
If you build open models
Proxy-KD is a blueprint for legal distillation from API teachers you pay for at research scale. The compound recipe from Turc et al. still applies:
Pre-train a capable student base (Qwen, GLM, Llama)
Collect teacher outputs (legally, at scale you can afford)
Align a proxy if you need distribution-level transfer
Distill with sample weighting
Evaluate on your tasks β not leaderboard marketing
If you follow the Fable ban
Distillation and export control are the same war viewed from different angles:
Anthropic wants to stop capability extraction via fake accounts
Commerce wants to stop capability export via API access to foreign nationals
Chinese labs want frontier capability without $285B US training spend (Stanford AI Index)
Proxy-KD is the research-side documentation of item three. The Senate Banking letter is the policy-side response to item one. The June 12 directive is the response to item two.
Proxy-KD is not the scandal β it is the instruction manual.
A 2024 paper from Alibaba-affiliated researchers described how to distill closed frontier models more efficiently than Alpaca ever could β using a proxy to recover the logit-level signal black-box APIs hide. Hacker News resurfaced it in June 2026 because Anthropic accused operators linked to the same ecosystem of running industrial-scale Claude extraction two days before Fable 5 went offline globally.
The technique is real, published, and improving. The allegation is about scale, fraud, and ToS β not about whether distillation works. It works. That is exactly why export controls, ID verification, and open-weight alternatives are all happening at once.
If you are choosing models today: understand that every closed API call is potentially a training example for someone else's student β unless you self-host or route through tiers you control.
Paper details and benchmark claims reflect arXiv:2401.07013 v2 (November 9, 2024). Anthropic allegations reflect June 2026 Senate Banking letter reporting. Verify live policy and API terms before production decisions.