GGUF - Quantization Format for llama.cpp
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
When to use GGUF
Use GGUF when:
- Deploying on consumer hardware (laptops, desktops)
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
- Need CPU inference without GPU requirements
- Want flexible quantization (Q2_K to Q8_0)
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
Key advantages:
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
- No Python runtime: Pure C/C++ inference
- Flexible quantization: 2-8 bit with various methods (K-quants)
- Ecosystem support: LM Studio, Ollama, koboldcpp, and more
- imatrix: Importance matrix for better low-bit quality
Use alternatives instead:
- AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
- HQQ: Fast calibration-free quantization for HuggingFace
- bitsandbytes: Simple integration with transformers library
- TensorRT-LLM: Production NVIDIA deployment with maximum speed
Quick start
Installation
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make
make GGML_CUDA=1
make GGML_METAL=1
pip install llama-cpp-python
Convert model to GGUF
pip install -r requirements.txt
python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
python convert_hf_to_gguf.py ./path/to/model \
--outfile model-f16.gguf \
--outtype f16
Quantize model
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
Run inference
./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
./llama-cli -m model-q4_k_m.gguf --interactive
./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
Quantization types
K-quant methods (recommended)
| Type |
Bits |
Size (7B) |
Quality |
Use Case |
| Q2_K |
2.5 |
~2.8 GB |
Low |
Extreme compression |
| Q3_K_S |
3.0 |
~3.0 GB |
Low-Med |
Memory constrained |
| Q3_K_M |
3.3 |
~3.3 GB |
Medium |
Balance |
| Q4_K_S |
4.0 |
~3.8 GB |
Med-High |
Good balance |
| Q4_K_M |
4.5 |
~4.1 GB |
High |
Recommended default |
| Q5_K_S |
5.0 |
~4.6 GB |
High |
Quality focused |
| Q5_K_M |
5.5 |
~4.8 GB |
Very High |
High quality |
| Q6_K |
6.0 |
~5.5 GB |
Excellent |
Near-original |
| Q8_0 |
8.0 |
~7.2 GB |
Best |
Maximum quality |
Legacy methods
| Type |
Description |
| Q4_0 |
4-bit, basic |
| Q4_1 |
4-bit with delta |
| Q5_0 |
5-bit, basic |
| Q5_1 |
5-bit with delta |
Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
Conversion workflows
Workflow 1: HuggingFace to GGUF
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
python convert_hf_to_gguf.py ./llama-3.1-8b \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
Workflow 2: With importance matrix (better quality)
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
Python is a popular programming language.
# Add more diverse text samples...
EOF
./llama-imatrix -m model-f16.gguf \
-f calibration.txt \
--chunk 512 \
-o model.imatrix \
-ngl 35
./llama-quantize --imatrix model.imatrix \
model-f16.gguf \
model-q4_k_m.gguf \
Q4_K_M
Workflow 3: Multiple quantizations
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
Python usage
llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
n_threads=8
)
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"]
)
print(output["choices"][0]["text"])
Chat completion
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
response = llm.create_chat_completion(
messages=messages,
max_tokens=256,
temperature=0.7
)
print(response["choices"][0]["message"]["content"])
Streaming
from llama_cpp import Llama
llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
for chunk in llm(
"Explain quantum computing:",
max_tokens=256,
stream=True
):
print(chunk["choices"][0]["text"], end="", flush=True)
Server mode
Start OpenAI-compatible server
./llama-server -m model-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096