What is llama.cpp in simple terms?

llama.cpp is an open-source C/C++ project for running large language models on consumer hardware. Created by Georgi Gerganov (now maintained under ggml-org on GitHub), it loads GGUF-quantized model files and runs inference on CPU, Apple Metal, NVIDIA CUDA, AMD ROCm, or Vulkan. The llama-server binary exposes an OpenAI-compatible HTTP API so tools like OpenCode, Continue, and Codex OSS can talk to local weights without cloud APIs.

Is llama.cpp the same as Ollama?

No. Ollama is a higher-level packaging layer — pull, serve, model library — that uses llama.cpp (and MLX on Mac) under the hood. llama.cpp is the raw engine: more flags, more control, no curated model registry. Power users pick llama.cpp for MTP speculative decoding, exact layer offload, router mode, and embedding/rerank endpoints. Beginners often start with Ollama and graduate to llama.cpp when they need throughput tuning.

How do I run a model with llama.cpp?

Install binaries (brew, release zip, or cmake build), download a GGUF file from Hugging Face, then either chat in terminal with llama-cli -m model.gguf or serve an API with llama-server -m model.gguf --port 8080. For Hugging Face integration use -hf org/model:quant-tag instead of -m. Point agents at http://127.0.0.1:8080/v1.

What is GGUF and why does llama.cpp use it?

GGUF (Georgi Gerganov Unified Format) is a single-file container for quantized model weights and metadata. It replaced GGML/GGJT and is the standard format on Hugging Face for local inference. llama.cpp reads GGUF directly — no PyTorch install required at runtime. See explainx.ai's quantization guide for Q4 vs Q8 trade-offs.

Which llama.cpp binary should I use?

llama-cli for interactive terminal chat and quick tests. llama-server for everything else — web UI at localhost:8080, OpenAI-compatible /v1/chat/completions, parallel users, tool calling, multimodal (experimental), embeddings, and router mode for multiple models. llama-quantize converts safetensors to GGUF when you self-quantize.

Can I use llama.cpp with coding agents like OpenCode?

Yes. Start llama-server, then set baseURL to http://127.0.0.1:8080/v1 in opencode.jsonc or any OpenAI-compatible harness. explainx.ai's OpenCode local stack guide and Qwen 3.6 27B post include full config blocks and smoke tests.

What Is llama.cpp? Run GGUF Models Locally | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

What Is llama.cpp? Run GGUF Models Locally | explainx.ai Blog | explainx.ai

text

┌──────────────────────────────────────────┐
│  llama-server / llama-cli  (user tools)  │
├──────────────────────────────────────────┤
│  libllama  (model load, decode, sampling)│
├──────────────────────────────────────────┤
│  ggml  (tensor ops — CPU/GPU backends)   │
└──────────────────────────────────────────┘
         ▲
         │ reads
    model.gguf  (quantized weights on disk)

Tool	Relationship to llama.cpp	When to pick it
llama.cpp	Core engine	Tuning, MTP, router, embeddings, exotic hardware
Ollama	Uses llama.cpp (and MLX on Mac)	Fastest first run, `ollama pull`, no flag soup
LM Studio	Embeds llama.cpp server	GUI model browser + local API toggle
MLX	Separate Apple stack	Pure Mac optimization; some models faster than llama.cpp on M-series
vLLM	Different codebase (PagedAttention)	Team server, many concurrent users on datacenter GPU

Endpoint	URL
Web chat UI	`http://127.0.0.1:8080`
OpenAI models list	`http://127.0.0.1:8080/v1/models`
Chat completions	`http://127.0.0.1:8080/v1/chat/completions`

Flag	Meaning	Typical value
`-m path.gguf`	Model file	Local path
`-hf org/repo:quant`	Hugging Face pull	`unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0`
`-ngl N`	GPU layers to offload	`999` = all layers on GPU/Metal
`-c N`	Context size (tokens)	`8192`–`65536`; higher = more RAM
`-fa on`	Flash attention	On when supported — faster long context
`--port N`	HTTP port	`8080` (convention)
`-np N`	Parallel slots / users	`4` on shared LAN box
`--spec-type draft-mtp`	Multi-token prediction	Qwen 3.6 MTP GGUF builds
`-md draft.gguf`	Speculative draft model	Smaller companion file
`--embedding`	Embedding mode	Dedicated embed models only

Machine	Starting point
32GB Mac	7B–13B Q4, `-ngl 999`, `-c 16384`
48–64GB Mac	27B Q4–Q8, MTP if available
24GB Nvidia	7B–14B Q4/Q5, watch KV cache vs `-c`
48GB+ Nvidia	27B–32B Q6/Q8, raise `-np` for roommates
CPU only	`-ngl 0`, tiny models (1B–3B), patience

bash

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
  }'

jsonc

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama": {
      "name": "llama.cpp (local)",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1",
        "apiKey": "local"
      },
      "models": {
        "local": { "name": "local-gguf" }
      }
    }
  },
  "model": "llama/local"
}

Issue	Reality
"Too many flags"	True vs Ollama. Keep a shell alias or Makefile for your daily server line.
Ollama drama	Some HN posts argue ethics of Ollama's packaging; llama.cpp is the neutral upstream. explainx.ai supports both — pick by friction vs control.
MoE sloppiness	Architecture matters more than runtime — dense vs MoE local coding.
Tool calling quality	llama-server supports tools; model must be trained for reliable function JSON. Weak local models → failed agent loops.
Windows friction	Builds exist; CUDA path is smoother on Linux. WSL2 + Nvidia is the usual Windows power-user route.
No training	Inference only. Fine-tune elsewhere (Unsloth GLM guide), infer here.

Do I need a GPU?	No — CPU works (slow). Metal on Apple Silicon, CUDA on Nvidia, Vulkan/ROCm elsewhere for speed.
What file format?	GGUF quants — see quantization guide.
Simplest run command?	`llama-server -m model.gguf --port 8080` → browser UI + `/v1` API.
vs Ollama?	Ollama = easy wrapper; llama.cpp = control plane. Same weights possible, different UX.
vs vLLM?	vLLM = multi-user production on big GPUs; llama.cpp = laptop to workstation breadth.
Best for coding agents?	`llama-server` + OpenAI-compatible `/v1` — full local + OpenCode path.
Example model walkthrough?	Qwen 3.6 27B + MTP flags.

What Is llama.cpp? Install, Run GGUF Models, and Serve OpenAI-Compatible APIs

TL;DR — what people search after "what is llama.cpp"

Related posts

Tencent Hy3 GGUF — 1-Bit and 4-Bit Quants for Single-GPU llama.cpp

How to Run Open Source Models Locally and Wire Them Into OpenCode (2026)

Qwen 3.6 27B Local Dev Guide: llama.cpp, OpenCode, and Why Dense Beats MoE

What llama.cpp actually is

llama.cpp vs the wrappers

Install llama.cpp

macOS (fastest)

Linux — prebuilt release

Build from source (CUDA / ROCm / custom)

Models — where GGUF files live

The two binaries you will actually use

`llama-cli` — terminal chat

`llama-server` — API + web UI

Essential flags (the ones that matter)

Step-by-step: first model in 10 minutes

1. Pick a small instruct model

2. Start the server

3. Chat in browser

4. Hit the API like OpenAI

5. Wire a coding agent

Advanced paths (when basics work)

Router mode — multiple models, one port

Embeddings and RAG

Self-quantize

Speculative decoding

What people complain about (honest limits)

llama.cpp in the explainx.ai local stack