AWQ (Activation-aware Weight Quantization)
4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.
When to use AWQ
Use AWQ when:
- Need 4-bit quantization with <5% accuracy loss
- Deploying instruction-tuned or chat models (AWQ generalizes better)
- Want ~2.5-3x inference speedup over FP16
- Using vLLM for production serving
- Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support
Use GPTQ instead when:
- Need maximum ecosystem compatibility (more tools support GPTQ)
- Working with ExLlamaV2 backend specifically
- Have older GPUs without Marlin support
Use bitsandbytes instead when:
- Need zero calibration overhead (quantize on-the-fly)
- Want to fine-tune with QLoRA
- Prefer simpler integration
Quick start
Installation
pip install autoawq
pip install autoawq[kernels]
pip install autoawq[cpu]
Requirements: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+
Load pre-quantized model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"
model = AutoAWQForCausalLM.from_quantized(
model_name,
fuse_layers=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantize your own model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")
Timing: ~10-15 min for 7B, ~1 hour for 70B models.
AWQ vs GPTQ vs bitsandbytes
| Feature |
AWQ |
GPTQ |
bitsandbytes |
| Speedup (4-bit) |
~2.5-3x |
~2x |
~1.5x |
| Accuracy loss |
<5% |
~5-10% |
~5-15% |
| Calibration |
Minimal (128-1K tokens) |
More extensive |
None |
| Overfitting risk |
Low |
Higher |
N/A |
| Best for |
Production inference |
GPU inference |
Easy integration |
| vLLM support |
Native |
Yes |
Limited |
Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.
Kernel backends
GEMM (default, batch inference)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
GEMV (single-token generation)
quant_config = {
"version": "GEMV"
}
Limitation: Only batch size 1, not good for large context.
Marlin (Ampere+ GPUs)
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
version="marlin"
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-AWQ",
quantization_config=config
)
Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)
ExLlamaV2 (AMD compatible)
config = AwqConfig(
bits=4,
version="exllama"
)
HuggingFace Transformers integration
Direct loading
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-alpha-AWQ",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")
Fused modules (recommended)
from transformers import AwqConfig, AutoModelForCausalLM
config = AwqConfig(
bits=4,
fuse_max_seq_len=512,
do_fuse=True
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-OpenOrca-AWQ",
quantization_config=config
)
Note: Fused modules cannot combine with FlashAttention2.
vLLM integration
from vllm import LLM, SamplingParams
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
dtype="half"
)
sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)
Performance benchmarks
Memory reduction
| Model |
FP16 |
AWQ 4-bit |
Reduction |
| Mistral 7B |
14 GB |
5.5 GB |
2.5x |
| Llama 2-13B |
26 GB |
10 GB |
2.6x |
| Llama 2-70B |
140 GB |
35 GB |
4x |
Inference speed (RTX 4090)
| Model |
Prefill (tok/s) |
Decode (tok/s) |
Memory |
| Mistral 7B GEMM |
3,897 |
114 |
5.55 GB |
| TinyLlama 1B GEMV |
5,179 |
431 |
2.10 GB |
| Llama 2-13B GEMM |
2,279 |
74 |
10.28 GB |
Accuracy (perplexity)
| Model |
FP16 |
AWQ 4-bit |
Degradation |
| Llama 3 8B |
8.20 |
8.48 |
+3.4% |
| Mistral 7B |
5.25 |
5.42 |
+3.2% |
| Qwen2 72B |
4.85 |
4.95 |
+2.1% |
Custom calibration data
model.quantize(
tokenizer,
quant_config=quant_config,
calib_data="wikitext",
max_calib_samples=256,
max_calib_seq_len=512
)
calib_samples = [
"Your domain-specific text here...",
"More examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)
Multi-GPU deployment
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-AWQ",
device_map="auto",
max_memory={0: "40GB", 1: "40GB"}
)
Supported models
35+ architectures including:
- Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
- Qwen: Qwen, Qwen2, Qwen2.5-VL
- Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
- Multimodal: LLaVA, LLaVA-Next, Qwen2-VL
Common issues
CUDA OOM during quantization:
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=6