model-pruning▌
davila7/claude-code-templates · updated Apr 8, 2026
Use Model Pruning when you need to:
Model Pruning: Compressing LLMs
When to Use This Skill
Use Model Pruning when you need to:
- Reduce model size by 40-60% with <1% accuracy loss
- Accelerate inference using hardware-friendly sparsity (2-4× speedup)
- Deploy on constrained hardware (mobile, edge devices)
- Compress without retraining using one-shot methods
- Enable efficient serving with reduced memory footprint
Key Techniques: Wanda (weights × activations), SparseGPT (second-order), structured pruning, N:M sparsity
Papers: Wanda ICLR 2024 (arXiv 2306.11695), SparseGPT (arXiv 2301.00774)
Installation
# Wanda implementation
git clone https://github.com/locuslab/wanda
cd wanda
pip install -r requirements.txt
# Optional: SparseGPT
git clone https://github.com/IST-DASLab/sparsegpt
cd sparsegpt
pip install -e .
# Dependencies
pip install torch transformers accelerate
Quick Start
Wanda Pruning (One-Shot, No Retraining)
Source: ICLR 2024 (arXiv 2306.11695)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Calibration data (small dataset for activation statistics)
calib_data = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning is transforming the world.",
"Artificial intelligence powers modern applications.",
]
# Wanda pruning function
def wanda_prune(model, calib_data, sparsity=0.5):
"""
Wanda: Prune by weight magnitude × input activation.
Args:
sparsity: Fraction of weights to prune (0.5 = 50%)
"""
# 1. Collect activation statistics
activations = {}
def hook_fn(name):
def hook(module, input, output):
# Store input activation norms
activations[name] = input[0].detach().abs().mean(dim=0)
return hook
# Register hooks for all linear layers
hooks = []
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
hooks.append(module.register_forward_hook(hook_fn(name)))
# Run calibration data
model.eval()
with torch.no_grad():
for text in calib_data:
inputs = tokenizer(text, return_tensors="pt").to(model.device)
model(**inputs)
# Remove hooks
for hook in hooks:
hook.remove()
# 2. Prune weights based on |weight| × activation
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear) and name in activations:
W = module.weight.data
act = activations[name]
# Compute importance: |weight| × activation
importance = W.abs() * act.unsqueeze(0)
# Flatten and find threshold
threshold = torch.quantile(importance.flatten(), sparsity)
# Create mask
mask = importance >= threshold
# Apply mask (prune)
W *= mask.float()
return model
# Apply Wanda pruning (50% sparsity, one-shot, no retraining)
pruned_model = wanda_prune(model, calib_data, sparsity=0.5)
# Save
pruned_model.save_pretrained("./llama-2-7b-wanda-50")
SparseGPT (Second-Order Pruning)
Source: arXiv 2301.00774
from sparsegpt import SparseGPT
# Load model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Initialize SparseGPT
pruner = SparseGPT(model)
# Calibration data
calib_data = load_calibration_data() # ~128 samples
# Prune (one-shot, layer-wise reconstruction)
pruned_model = pruner.prune(
calib_data=calib_data,
sparsity=0.5, # 50% sparsity
prunen=0, # Unstructured (0) or N:M structured
prunem=0,
percdamp=0.01, # Damping for Hessian inverse
)
# Results: Near-lossless pruning at 50% sparsity
N:M Structured Pruning (Hardware Accelerator)
def nm_prune(weight, n=2, m=4):
"""
N:M pruning: Keep N weights per M consecutive weights.
Example: 2:4 = keep 2 out of every 4 weights.
Compatible with NVIDIA sparse tensor cores (2:4, 4:8).
"""
# Reshape weight into groups of M
shape = weight.shape
weight_flat = weight.flatten()
# Pad to multiple of M
pad_size = (m - weight_flat.numel() % m) % m
weight_padded = F.pad(weight_flat, (0, pad_size))
# Reshape into (num_groups, m)
weight_grouped = weight_padded.reshape(-1, m)
# Find top-N in each group
_, indices = torch.topk(weight_grouped.abs(), n, dim=-1)
# Create mask
mask = torch.zeros_like(weight_grouped)
mask.scatter_(1, indices, 1.0)
# Apply mask
weight_pruned = weight_grouped * mask
# Reshape back
weight_pruned = weight_pruned.flatten()[:weight_flat.numel()]
return weight_pruned.reshape(shape)
# Apply 2:4 sparsity (NVIDIA hardware)
for name, module in model.named_modules()Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.6★★★★★45 reviews- ★★★★★Nikhil Singh· Dec 28, 2024
model-pruning is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Nia Flores· Dec 4, 2024
Useful defaults in model-pruning — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Carlos Sethi· Dec 4, 2024
model-pruning fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Kabir Menon· Nov 23, 2024
We added model-pruning from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Mei Perez· Nov 19, 2024
Keeps context tight: model-pruning is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Min Garcia· Nov 19, 2024
model-pruning has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Rahul Santra· Nov 7, 2024
model-pruning fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Pratham Ware· Oct 26, 2024
model-pruning has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Kabir Iyer· Oct 14, 2024
model-pruning reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Mei Mensah· Oct 10, 2024
I recommend model-pruning for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
showing 1-10 of 45