benchmark-models▌
garrytan/gstack · updated Apr 22, 2026
Cross-model benchmark skill for comparing Claude, GPT/Codex, and Gemini on the same prompt and optionally an LLM-judge quality pass.
Cross-model benchmark skill for comparing Claude, GPT/Codex, and Gemini on the same prompt and optionally an LLM-judge quality pass. Imported from benchmark-models/SKILL.md in garrytan/gstack.
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.5★★★★★65 reviews- ★★★★★Chaitanya Patil· Dec 20, 2024
Solid pick for teams standardizing on skills: benchmark-models is focused, and the summary matches what you get after install.
- ★★★★★Advait Singh· Dec 20, 2024
Useful defaults in benchmark-models — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Charlotte Gupta· Dec 16, 2024
benchmark-models has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Dev Jain· Dec 8, 2024
benchmark-models is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Arya Rahman· Dec 8, 2024
Solid pick for teams standardizing on skills: benchmark-models is focused, and the summary matches what you get after install.
- ★★★★★Layla Choi· Nov 27, 2024
benchmark-models fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★James Abbas· Nov 27, 2024
I recommend benchmark-models for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Dev Reddy· Nov 19, 2024
benchmark-models reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Rahul Santra· Nov 11, 2024
I recommend benchmark-models for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Layla Kim· Nov 11, 2024
benchmark-models has been reliable in day-to-day use. Documentation quality is above average for community skills.
showing 1-10 of 65