What Is llama.cpp? Install, Run GGUF Models, and Serve OpenAI-Compatible APIs
llama.cpp is the C/C++ inference engine behind most local LLM stacks โ GGUF quants, Metal/CUDA/Vulkan/CPU, llama-server on :8080/v1. Install, run models, key flags, vs Ollama, and wire to OpenCode.
July 2, 2026: Every week another model ships as GGUF on Hugging Face, another HN thread argues Mac vs Nvidia for local LLMs, and another developer asks "do I need Ollama or something else?" The answer underneath most stacks is the same engine: llama.cpp.
Georgi Gerganov open-sourced it in March 2023 to run Meta's Llama weights on a MacBook CPU. Today the ggml-org/llama.cpp repo carries 118k+ GitHub stars, 450+ contributors, and a release cadence that shipped build b9829 on June 28, 2026. Ollama, LM Studio, and dozens of embedders sit on top of it โ but when you want MTP speculative decoding, exact -ngl offload, router mode, or embedding endpoints, you run llama.cpp directly.
This post is explainx.ai's foundation guide: what llama.cpp is, how GGUF fits, install paths, llama-cli vs llama-server, copy-paste run commands, and how to hand the API to OpenCode or Codex OSS.
TL;DR โ what people search after "what is llama.cpp"
Question
Answer
Do I need a GPU?
No โ CPU works (slow). Metal on Apple Silicon, CUDA on Nvidia, Vulkan/ROCm elsewhere for speed.
llama.cpp is not a model and not a chat app. It is an inference runtime โ load weights, manage KV cache, sample tokens, optionally expose HTTP.
GGUF is the file format llama.cpp natively consumes. One file holds architecture metadata plus Q4/Q5/Q8 (or F16) tensors. That is why a 27B model that would need hundreds of GB in FP32 can run in ~18โ48GB RAM depending on quant โ the math in our quantization guide applies directly here.
Historical note: The project predates the current explosion of Chinese open weights (Qwen, GLM, DeepSeek). llama.cpp's role stayed constant: make the file on disk talk.
Pure Mac optimization; some models faster than llama.cpp on M-series
vLLM
Different codebase (PagedAttention)
Team server, many concurrent users on datacenter GPU
explainx.ai read: Start with building a personal local AI system for the full layer cake. Drop to raw llama.cpp when Ollama's defaults leave performance on the table โ especially on Apple Silicon with MTP (Qwen 3.6 benchmarks).
Homebrew ships recent builds with Metal enabled on Apple Silicon.
Linux โ prebuilt release
Download the matching ggml-org/llama.cpp release asset for your OS/CUDA version from GitHub Releases, extract, add to PATH.
Build from source (CUDA / ROCm / custom)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # or -DGGML_METAL=ON on Mac
cmake --build build --config Release -j
# binaries in build/bin/
Use source builds when you need a specific CUDA arch, ROCm on AMD, or bleeding-edge server features before packagers catch up.
Models โ where GGUF files live
Hugging Face โ search GGUF, publishers like unsloth, bartowski, MaziyarPanahi
2026 server features (see server README): parallel decoding (-np), OpenAI + Anthropic-compatible chat routes, function calling, speculative decoding, multimodal input (experimental), embeddings and reranking endpoints, router mode for multiple models with LRU eviction, built-in MCP hooks in the web UI.
Essential flags (the ones that matter)
Flag
Meaning
Typical value
-m path.gguf
Model file
Local path
-hf org/repo:quant
Hugging Face pull
unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0
-ngl N
GPU layers to offload
999 = all layers on GPU/Metal
-c N
Context size (tokens)
8192โ65536; higher = more RAM
-fa on
Flash attention
On when supported โ faster long context
--port N
HTTP port
8080 (convention)
-np N
Parallel slots / users
4 on shared LAN box
--spec-type draft-mtp
Multi-token prediction
Qwen 3.6 MTP GGUF builds
-md draft.gguf
Speculative draft model
Smaller companion file
--embedding
Embedding mode
Dedicated embed models only
Sampling (quality vs creativity): tie to temperature, top-p, top-k guide โ llama-server accepts the same params in API JSON as OpenAI.
Omit -m and point at a models directory โ llama-server loads models on demand, evicts LRU when memory is full. Useful for a home lab with 3B for fast autocomplete and 27B for hard prompts. See router mode docs in server README.
Most users should download pre-made GGUF from trusted quantizers instead.
Speculative decoding
Load a small draft model (-md draft.gguf) or MTP weights (--spec-type draft-mtp) so the big model verifies multiple tokens per step โ the ~32 tok/s vs ~18 tok/s gap in the Qwen 3.6 post.
What people complain about (honest limits)
Issue
Reality
"Too many flags"
True vs Ollama. Keep a shell alias or Makefile for your daily server line.
Ollama drama
Some HN posts argue ethics of Ollama's packaging; llama.cpp is the neutral upstream. explainx.ai supports both โ pick by friction vs control.
Binary names, server routes, and star counts reflect the ggml-org/llama.cpp repo as of July 2, 2026 โ verify release notes before production deployments. Last updated: July 2, 2026.