serving-llms-vllm
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Works with
0
total installs
0
this week
24.2K
GitHub stars
0
upvotes
Install Skill
Run in your terminal
0
installs
0
this week
24.2K
stars
Installation Guide
How to use serving-llms-vllm on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your machine
- ›Node.js 16+ with npm — verify with
node --version - ›Active project directory where you want to add
serving-llms-vllm
Run the install command
Execute the skills CLI command in your project's root directory to begin installation:
Fetches serving-llms-vllm from davila7/claude-code-templates and configures it for Cursor.
Select Cursor when prompted
The CLI shows a list of agents. Use arrow keys and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Restart Cursor to activate serving-llms-vllm. Access via /serving-llms-vllm in your agent's command palette.
Security Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.
Documentation
vLLM - High-Performance LLM Serving
Quick start
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
Installation:
pip install vllm
Basic offline inference:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing"], sampling)
print(outputs[0].outputs[0].text)
OpenAI-compatible server:
vllm serve meta-llama/Llama-3-8B-Instruct
# Query with OpenAI SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
print(client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"
Common workflows
Workflow 1: Production API deployment
Copy this checklist and track progress:
Deployment Progress:
- [ ] Step 1: Configure server settings
- [ ] Step 2: Test with limited traffic
- [ ] Step 3: Enable monitoring
- [ ] Step 4: Deploy to production
- [ ] Step 5: Verify performance metrics
Step 1: Configure server settings
Choose configuration based on your model size:
# For 7B-13B models on single GPU
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--port 8000
# For 30B-70B models with tensor parallelism
vllm serve meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--quantization awq \
--port 8000
# For production with caching and metrics
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--enable-metrics \
--metrics-port 9090 \
--port 8000 \
--host 0.0.0.0
Step 2: Test with limited traffic
Run load test before production:
# Install load testing tool
pip install locust
# Create test_load.py with sample requests
# Run: locust -f test_load.py --host http://localhost:8000
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
Step 3: Enable monitoring
vLLM exposes Prometheus metrics on port 9090:
curl http://localhost:9090/metrics | grep vllm
Key metrics to monitor:
vllm:time_to_first_token_seconds- Latencyvllm:num_requests_running- Active requestsvllm:gpu_cache_usage_perc- KV cache utilization
Step 4: Deploy to production
Use Docker for consistent deployment:
# Run vLLM in Docker
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching
Step 5: Verify performance metrics
Check that deployment meets targets:
- TTFT < 500ms (for short prompts)
- Throughput > target req/sec
- GPU utilization > 80%
- No OOM errors in logs
Workflow 2: Offline batch inference
For processing large datasets without server overhead.
Copy this checklist:
Batch Processing:
- [ ] Step 1: Prepare input data
- [ ] Step 2: Configure LLM engine
- [ ] Step 3: Run batch inference
- [ ] Step 4: Process results
Step 1: Prepare input data
# Load prompts from file
prompts = []
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
print(f"Loaded {len(prompts)} prompts")
Step 2: Configure LLM engine
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
tensor_parallel_size=2, # Use 2 GPUs
gpu_memory_utilization=0.9,
max_model_len=4096
)
sampling = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
stop=["</s>", "\n\n"]
)
Step 3: Run batch inference
vLLM automatically batches requests for efficiency:
# Process all prompts in one call
outputs = llm.generate(prompts, sampling)
# vLLM handles batching internally
# No need to manually chunk prompts
Step 4: Process results
# Extract generated text
results = []
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
results.append({
"prompt": prompt,
"generated": generated,
"tokens": len(output.outputs[0].token_ids)
})
# Save to file
import json
with open("results.jsonl", "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"Processed {len(results)} prompts")
Workflow 3: Quantized model serving
Fit large models in limited GPU memory.
Quantization Setup:
- [ ] Step 1: Choose quantization method
- [ ] Step 2: Find or create quantized model
- [ ] Step 3: Launch with quantization flag
- [ ] Step 4: Verify accuracy
Step 1: Choose quantization method
- AWQ: Best for 70B models, minimal accuracy loss
- GPTQ: Wide model support, good compression
- FP8: Fastest on H100 GPUs
Step 2: Find or create quantized model
Use pre-quantized models from HuggingFace:
# Search for AWQ models
# Example: TheBloke/Llama-2-70B-AWQ
Step 3: Launch with quantization flag
# Using pre-quantized model
vllm serve TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95
# Results: 70B model in ~40GB VRAM
Step 4: Verify accuracy
Test outputs match expected quality:
# Compare quantized vs non-quantized responses
# Verify task-specific performance unchanged
When to use vs alternatives
Use vLLM when:
- Deploying production LLM APIs (100+ req/sec)
- Serving OpenAI-compatible endpoints
- Limited GPU memory but need large models
- Multi-user applications (chatbots, assistants)
- Need low latency with high throughput
Use alternatives instead:
- llama.cpp: CPU/edge inference, single-user
- HuggingFace transformers: Research, prototyping, one-off generation
- TensorRT-LLM: NVIDIA-only, need absolute maximum performance
- Text-Generation-Inference: Already in HuggingFace ecosystem
Common issues
Issue: Out of memory during model loading
Reduce memory usage:
vllm serve MODEL \
--gpu-memory-utilization 0.7 \
--max-model-len 4096
Or use quantization:
vllm serve MODEL --quantization awq
Issue: Slow first token (TTFT > 1 second)
Enable prefix caching for repeated prompts:
vllm serve MODEL --enable-prefix-caching
For long prompts, enable chunked prefill:
vllm serve MODEL --enable-chunked-prefill
Issue: Model not found error
Use --trust-remote-code for custom models:
vllm serve MODEL --trust-remote-code
Issue: Low throughput (<50 req/sec)
Increase concurrent sequences:
vllm serve MODEL --max-num-seqs 512
Check GPU utilization with nvidia-smi - should be >80%.
Issue: Inference slower than expected
Verify tensor parallelism uses power of 2 GPUs:
vllm serve MODEL --tensor-parallel-size 4 # Not 3
Enable speculative decoding for faster generation:
vllm serve MODEL --speculative-model DRAFT_MODEL
Advanced topics
Server deployment patterns: See Submit your Claude Code skill and start earningList & Monetize Your Skill
Use Cases
Task Automation & Efficiency
Automate repetitive workflows and reduce manual effort
Example
Generate reports, summarize documents, draft communications
Save 3-5 hours per week on routine tasks
Knowledge Enhancement
Learn new skills, understand complex topics, get expert guidance
Example
Explain concepts, provide examples, suggest learning resources
Accelerate learning and skill development by 2x
Quality Improvement
Enhance output quality through reviews, suggestions, and refinements
Example
Review drafts, suggest improvements, catch errors
Improve work quality by 30-40% with less effort
Implementation Guide
Prerequisites
- ›Claude Desktop or compatible AI client with skill support
- ›Clear understanding of task or problem to solve
- ›Willingness to iterate and refine outputs
Time Estimate
15-45 minutes depending on use case complexity
Steps
- 1Install skill using provided installation command
- 2Test with simple use case relevant to your work
- 3Evaluate output quality and relevance
- 4Iterate on prompts to improve results
- 5Integrate into regular workflow if valuable
Common Pitfalls
- ⚠Expecting perfect results without iteration
- ⚠Not providing enough context in prompts
- ⚠Using skill for tasks outside its intended scope
- ⚠Accepting outputs without review and validation
Best Practices
✓ Do
- +Start with clear, specific prompts
- +Provide relevant context and constraints
- +Review and refine all outputs before using
- +Iterate to improve output quality
- +Document successful prompt patterns
✗ Don't
- −Don't use without understanding skill limitations
- −Don't skip validation of outputs
- −Don't share sensitive information in prompts
- −Don't expect skill to replace human judgment
💡 Pro Tips
- ★Be specific about desired format and style
- ★Ask for multiple options to choose from
- ★Request explanations to understand reasoning
- ★Combine AI efficiency with human expertise
When to Use This
✓ Use when
Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.
✗ Avoid when
Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.
Learning Path
- 1Familiarize yourself with skill capabilities and limitations
- 2Start with low-risk, non-critical tasks
- 3Progress to more complex and valuable use cases
- 4Build expertise through regular use and experimentation
Related Skills
ml-paper-writing
66davila7/claude-code-templates
docker-expert
12davila7/claude-code-templates
remotion-best-practices
10davila7/claude-code-templates
senior-data-engineer
8davila7/claude-code-templates
telegram-mini-app
8davila7/claude-code-templates
senior-backend
7davila7/claude-code-templates
Reviews
- MMaya Menon★★★★★Dec 24, 2024
I recommend serving-llms-vllm for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- YYash Thakker★★★★★Dec 12, 2024
serving-llms-vllm fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- AAma Kapoor★★★★★Dec 12, 2024
serving-llms-vllm fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- RRahul Santra★★★★★Dec 8, 2024
serving-llms-vllm reduced setup friction for our internal harness; good balance of opinion and flexibility.
- MMaya Verma★★★★★Nov 15, 2024
Solid pick for teams standardizing on skills: serving-llms-vllm is focused, and the summary matches what you get after install.
- PPratham Ware★★★★★Nov 3, 2024
serving-llms-vllm is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- OOshnikdeep★★★★★Oct 22, 2024
Keeps context tight: serving-llms-vllm is the kind of skill you can hand to a new teammate without a long onboarding doc.
- MMaya Thomas★★★★★Oct 6, 2024
serving-llms-vllm has been reliable in day-to-day use. Documentation quality is above average for community skills.
- MMia Patel★★★★★Sep 25, 2024
serving-llms-vllm fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- DDiya Malhotra★★★★★Sep 21, 2024
serving-llms-vllm reduced setup friction for our internal harness; good balance of opinion and flexibility.
showing 1-10 of 28
Discussion
Comments — not star reviews- No comments yet — start the thread.