tao-generate-referring-expressions
Four-step image referring-expression pipeline: turns images plus KITTI bounding-box labels into region
Works with
0
total installs
0
this week
1.7K
GitHub stars
0
upvotes
Install Skill
Run in your terminal
0
installs
0
this week
1.7K
stars
Installation Guide
How to use tao-generate-referring-expressions on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your machine
- ›Node.js 16+ with npm — verify with
node --version - ›Active project directory where you want to add
tao-generate-referring-expressions
Run the install command
Execute the skills CLI command in your project's root directory to begin installation:
Fetches tao-generate-referring-expressions from nvidia/skills and configures it for Cursor.
Select Cursor when prompted
The CLI shows a list of agents. Use arrow keys and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Restart Cursor to activate tao-generate-referring-expressions. Access via /tao-generate-referring-expressions in your agent's command palette.
Security Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.
Documentation
| name | tao-generate-referring-expressions |
| description | "Four-step image referring-expression pipeline: turns images plus KITTI bounding-box labels into region descriptions, scene captions, grounded referring expressions, and (optionally) verified expressions via VLM distillation. Use when the user wants to generate referring-expression annotations from images with KITTI labels, build region descriptions, produce grouped grounding phrases tied to bboxes, run a double-check verification pass on grounding expressions, auto-label traffic / scene images for referring datasets, or run the image_referring_expression pipeline. Triggers include 'referring expression', 'region description', 'KITTI labels', 'spatial relationship annotation', 'auto-label image referring expression', 'image_referring_expression'." |
| license | Apache-2.0 |
| compatibility | Requires docker + nvidia-container-toolkit + at least one VLM endpoint (Gemini API key or OpenAI-compatible). |
| metadata | author: NVIDIA Corporation version: "0.1.0" |
| tags | - image - referring-expression - kitti - bounding-boxes - auto-label - vlm |
| allowed-tools | Read Bash Write |
Image Referring Expression Pipeline
Generate referring-expression and grounding annotations from images with KITTI-format bounding box labels. A single VLM (Gemini or any OpenAI-compatible endpoint) runs four steps: per-object region descriptions, holistic image captions, grouped grounding expressions tied to bboxes, and an optional double-check verification pass.
Purpose
Transform (image, KITTI labels) pairs into a unified annotations.jsonl containing rich, grounded referring expressions. The VLM acts as a "teacher" annotator: Steps 0-1 see the image; Step 2 groups Step 0 outputs into grouping phrases with bbox lists; Step 3 (optional) re-examines those bboxes against the image and corrects mismatches.
Pipeline Architecture
Step 0: Region expression ──┐
├──▶ Step 2: Grounding expression ──▶ [Step 3: Double check]
Step 1: Image caption ──────┘ (optional)
- Step 0 (region_expr) — VLM emits one short discriminative phrase per KITTI bbox (
bbox_2d,type,color,description). - Step 1 (image_caption) — VLM emits a holistic, location-agnostic scene caption.
- Step 2 (grounding_expr) — VLM groups Step 0 objects into grouping phrases and returns one bbox list per group, optionally using Step 1's caption as extra context.
- Step 3 (double_check) — VLM re-checks each Step 2 bbox against the image; bad matches are removed, slightly-off boxes get tightened.
Steps 0 and 1 run in parallel within a single thread pool (they only depend on the seed records). Each step writes its own step_<N>_*/annotations.jsonl and skips already-processed images on re-run unless workflow.force_reprocess: true.
Instructions
Initial setup
When a user wants to run this pipeline, walk through these steps:
-
Images: Ask for
data.image_dir, the directory containing.jpg,.jpeg, or.pngimages. -
KITTI labels: Ask for
data.kitti_label_dir, the directory containing one.txtlabel file per image. Each label line must use KITTI format:<type> <truncated> <occluded> <alpha> <bbox_left> <bbox_top> <bbox_right> <bbox_bottom> .... Lines with fewer than 8 fields are silently skipped. Set this even for Step 1-only runs because Steps 0 and 2 require it. -
Resume from existing annotations: If the user already has a unified
annotations.jsonlfrom a previous run, setdata.input_annotations_jsonlto that file instead of seeding fromdata.image_diranddata.kitti_label_dir. -
API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
- Gemini — set
vlm.backend: "gemini"; requireGOOGLE_API_KEY(env var orvlm.gemini.api_key). - NIM (e.g.
https://inference-api.nvidia.com/v1) — setvlm.backend: "openai"; collectbase_url,model_name, andapi_key. - TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
- Running — collect
base_url,model_name, and (optionally)api_key; setvlm.backend: "openai". - Not running — guide the user through the
skills/applications/tao-run-inference-serviceskill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, checkskills/applications/tao-run-inference-service/references/service.yamlforvalid_network_arch_config_basenames. Once the server is up, collectbase_url,model_name, and (optionally)api_key; setvlm.backend: "openai".
- Running — collect
- vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
- Running — collect
base_url,model_name, and (optionally)api_key; setvlm.backend: "openai". - Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect
base_url,model_name, and (optionally)api_key; setvlm.backend: "openai".
- Running — collect
- Custom (any other OpenAI-compatible endpoint) — set
vlm.backend: "openai"; collectbase_url,model_name, and (optionally)api_key.
If the user has no endpoint and does not want to set one up, stop and help resolve API access first.
- Gemini — set
-
Workflow steps: Choose one of:
- Full pipeline:
["0", "1", "2", "3"] - No caption generation:
["0", "2", "3"], where Step 2 falls back to image-only context - No verification:
["0", "1", "2"] - Custom subset: any supported subset of steps
- Full pipeline:
-
Output format: Choose one of:
jsonl: unified schema onlylegacy: byte-compatible.txt.stepNfiles onlyboth: writes both formats and is the default for downstream tooling
Running the pipeline
The pipeline runs inside the TAO Toolkit container via the auto_label CLI:
auto_label generate -e /path/to/spec.yaml \
results_dir=/results \
image_referring_expression.data.image_dir=/data/images \
image_referring_expression.data.kitti_label_dir=/data/labels \
image_referring_expression.vlm.gemini.api_key=$GOOGLE_API_KEY
Generate a default spec: auto_label default_specs results_dir=/results module_name=auto_label, then set autolabel_type: "image_referring_expression". All fields support Hydra dot-notation overrides on the command line.
See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.
Recommended pilot workflow
- Run on 5-10 images with all four steps.
- Inspect
step_0_region_expr/annotations.jsonl— are object types, colors, and discriminating phrases accurate? - Inspect
step_2_grounding_expr/annotations.jsonl— are objects grouped sensibly, and do bbox coordinates match the described groups? - Inspect
step_3_double_check/annotations.jsonl— were mismatched bboxes removed or tightened? Are any new errors introduced (rare)? - If quality is insufficient, switch the VLM to a stronger model (e.g.
gemini-2.5-proor a larger Qwen3-VL endpoint), raisemedia_resolution/max_output_tokens, then re-run withworkflow.force_reprocess=true. - Scale to the full dataset once satisfied.
Configuration
Key configuration fields (full reference in references/configuration.md):
| Field | Default | Description |
|---|---|---|
workflow.steps | ["0","1","2","3"] | Which steps to execute (0=region_expr, 1=image_caption, 2=grounding_expr, 3=double_check) |
workflow.max_workers | 4 | Parallel threads per step (watch API rate limits) |
workflow.force_reprocess | false | Ignore cached per-step outputs and reprocess from scratch |
workflow.output_format | "jsonl" (set to "both" in the default spec) | "jsonl", "legacy", or "both" |
vlm.backend | "gemini" | "gemini" or "openai" (OpenAI-compatible endpoint) |
data.image_dir | required | Directory of input images (.jpg / .jpeg / .png) |
data.kitti_label_dir | required (unless resuming) | Directory of KITTI-format .txt label files |
data.input_annotations_jsonl | "" | Optional pre-seeded annotations.jsonl (skips KITTI seeding) |
Inputs
Two ways to seed the pipeline:
- Image directory + KITTI labels (default). Set
data.image_diranddata.kitti_label_dir. The orchestrator walks the image directory, reads the matching<stem>.txtKITTI file, parses bboxes (fields 0 + 4-7), reads each image'swidth/heightvia PIL, and writes aseed_annotations.jsonltoresults_dir/. - Pre-seeded annotations JSONL (resume / pre-computed regions). Set
data.input_annotations_jsonlto a file with one{"image_id", "image_path", "width", "height", "kitti_bboxes": [...]}object per line.
Outputs
All outputs go to results_dir/:
seed_annotations.jsonl— initial per-image records (unlessinput_annotations_jsonlwas supplied).step_0_region_expr/annotations.jsonl— addsregions[](each withbbox/bbox_2d,type,color,description).step_1_image_caption/annotations.jsonl— addscaption(string).step_2_grounding_expr/annotations.jsonl— addsexpressions[](each{text, instances: [{bbox: [x1,y1,x2,y2]}]}).step_3_double_check/annotations.jsonl— same shape as Step 2, with bboxes removed/updated.results_dir/annotations.jsonl— copy of the last completed step's output.- When
workflow.output_formatis"legacy"or"both", each step also writes byte-compatiblestep_<N>_*/labels/<stem>.txt.stepNfiles for the original 2d-data-engine tooling.
Prerequisites
- Container:
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt - API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)
- PIL / Pillow: Required to read image dimensions during seeding (already present in the TAO container)
List & Monetize Your Skill
Submit your Claude Code skill and start earning
Use Cases
Task Automation & Efficiency
Automate repetitive workflows and reduce manual effort
Example
Generate reports, summarize documents, draft communications
Save 3-5 hours per week on routine tasks
Knowledge Enhancement
Learn new skills, understand complex topics, get expert guidance
Example
Explain concepts, provide examples, suggest learning resources
Accelerate learning and skill development by 2x
Quality Improvement
Enhance output quality through reviews, suggestions, and refinements
Example
Review drafts, suggest improvements, catch errors
Improve work quality by 30-40% with less effort
Implementation Guide
Prerequisites
- ›Claude Desktop or compatible AI client with skill support
- ›Clear understanding of task or problem to solve
- ›Willingness to iterate and refine outputs
Time Estimate
15-45 minutes depending on use case complexity
Steps
- 1Install skill using provided installation command
- 2Test with simple use case relevant to your work
- 3Evaluate output quality and relevance
- 4Iterate on prompts to improve results
- 5Integrate into regular workflow if valuable
Common Pitfalls
- ⚠Expecting perfect results without iteration
- ⚠Not providing enough context in prompts
- ⚠Using skill for tasks outside its intended scope
- ⚠Accepting outputs without review and validation
Best Practices
✓ Do
- +Start with clear, specific prompts
- +Provide relevant context and constraints
- +Review and refine all outputs before using
- +Iterate to improve output quality
- +Document successful prompt patterns
✗ Don't
- −Don't use without understanding skill limitations
- −Don't skip validation of outputs
- −Don't share sensitive information in prompts
- −Don't expect skill to replace human judgment
💡 Pro Tips
- ★Be specific about desired format and style
- ★Ask for multiple options to choose from
- ★Request explanations to understand reasoning
- ★Combine AI efficiency with human expertise
When to Use This
✓ Use when
Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.
✗ Avoid when
Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.
Learning Path
- 1Familiarize yourself with skill capabilities and limitations
- 2Start with low-risk, non-critical tasks
- 3Progress to more complex and valuable use cases
- 4Build expertise through regular use and experimentation
Related Skills
dynamo-router-starter
0nvidia/skills
cuopt-install
0nvidia/skills
holoscan-install-container
0nvidia/skills
jetson-print-bsp-info
0nvidia/skills
jetson-memory-audit
0nvidia/skills
jetson-speculative-decoding
0nvidia/skills
Reviews
- DDiego Kapoor★★★★★Dec 24, 2024
Useful defaults in tao-generate-referring-expressions — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- CCamila Anderson★★★★★Dec 8, 2024
Solid pick for teams standardizing on skills: tao-generate-referring-expressions is focused, and the summary matches what you get after install.
- SShikha Mishra★★★★★Dec 4, 2024
We added tao-generate-referring-expressions from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- CCamila Yang★★★★★Dec 4, 2024
Registry listing for tao-generate-referring-expressions matched our evaluation — installs cleanly and behaves as described in the markdown.
- KKiara Malhotra★★★★★Nov 27, 2024
tao-generate-referring-expressions has been reliable in day-to-day use. Documentation quality is above average for community skills.
- RRahul Santra★★★★★Nov 23, 2024
Useful defaults in tao-generate-referring-expressions — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- CCamila Flores★★★★★Nov 23, 2024
tao-generate-referring-expressions reduced setup friction for our internal harness; good balance of opinion and flexibility.
- KKiara Mehta★★★★★Nov 15, 2024
We added tao-generate-referring-expressions from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- CCarlos Chen★★★★★Oct 18, 2024
tao-generate-referring-expressions fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- PPratham Ware★★★★★Oct 14, 2024
Registry listing for tao-generate-referring-expressions matched our evaluation — installs cleanly and behaves as described in the markdown.
showing 1-10 of 71
Discussion
Comments — not star reviews- No comments yet — start the thread.