Initialize Warehouse Schema
Generate a comprehensive, user-editable schema reference file for the data warehouse.
Scripts: ../analyzing-data/scripts/ β All CLI commands below are relative to the analyzing-data skill's directory. Before running any scripts/cli.py command, cd to ../analyzing-data/ relative to this file.
What This Does
- Discovers all databases, schemas, tables, and columns from the warehouse
- Enriches with codebase context (dbt models, gusty SQL, schema docs)
- Records row counts and identifies large tables
- Generates
.astro/warehouse.md - a version-controllable, team-shareable reference
- Enables instant conceptβtable lookups without warehouse queries
Process
Step 1: Read Warehouse Configuration
cat ~/.astro/agents/warehouse.yml
Get the list of databases to discover (e.g., databases: [HQ, ANALYTICS, RAW]).
Step 2: Search Codebase for Context (Parallel)
Launch a subagent to find business context in code:
Task(
subagent_type="Explore",
prompt="""
Search for data model documentation in the codebase:
1. dbt models: **/models/**/*.yml, **/schema.yml
- Extract table descriptions, column descriptions
- Note primary keys and tests
2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
- Parse frontmatter for: description, primary_key, tests
- Note schema mappings
3. AGENTS.md or CLAUDE.md files with data layer documentation
Return a mapping of:
table_name -> {description, primary_key, important_columns, layer}
"""
)
Step 3: Parallel Warehouse Discovery
Launch one subagent per database using the Task tool:
For each database in configured_databases:
Task(
subagent_type="general-purpose",
prompt="""
Discover all metadata for database {DATABASE}.
Use the CLI to run SQL queries:
# Scripts are relative to ../analyzing-data/
uv run scripts/cli.py exec "df = run_sql('...')"
uv run scripts/cli.py exec "print(df)"
1. Query schemas:
SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA
2. Query tables with row counts:
SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
ORDER BY TABLE_SCHEMA, TABLE_NAME
3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'X'
Return a structured summary:
- Database name
- List of schemas with table counts
- For each table: name, row_count, key columns
- Flag any tables with >100M rows as "large"
"""
)
Run all subagents in parallel (single message with multiple Task calls).
Step 4: Discover Categorical Value Families
For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"
Group related values into families by common prefix/suffix (e.g., Export* for ExportCSV, ExportJSON, ExportParquet).
Step 5: Merge Results
Combine warehouse metadata + codebase context:
- Quick Reference table - concept β table mappings (pre-populated from code if found)
- Categorical Columns - value families for key filter columns
- Database sections - one per database
- Schema subsections - tables grouped by schema
- Table details - columns, row counts, descriptions from code, warnings
Step 6: Generate warehouse.md
Write the file to:
.astro/warehouse.md (default - project-specific, version-controllable)
~/.astro/agents/warehouse.md (if --global flag)
Output Format
# Warehouse Schema
> Generated by `/astronomer-data:warehouse-init` on {DATE}. Edit freely to add business context.
## Quick Reference
|---------|-------|------------|-------------|
| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |
## Categorical Columns
When filtering on these columns, explore value families first (values often have variants):
|-------|--------|----------------|
| {TABLE} | {COLUMN} | `{PREFIX}*` ({VALUE1}, {VALUE2}, ...) |
## Data Layer Hierarchy
Query downstream first: `reporting` > `mart_*` > `metric_*` > `model_*` > `IN_*`
|-------|--------|---------|
| Reporting | `reporting.*` | Dashboard-optimized |
| Mart | `mart_*` | Combined analytics |
| Metric | `metric_*` | KPIs at various grains |
| Model | `model_*` | Cleansed sources of truth |
| Raw | `IN_*` | Source data - avoid |
## {DATABASE} Database
### {SCHEMA} Schema
#### {TABLE_NAME}
{DESCRIPTION from code if found}
|--------|------|-------------|
| COL1 | VARCHAR | {from code or inferred} |
- **Rows:** {ROW_COUNT}
- **Key column:** {PRIMARY_KEY from code or inferred}
{IF ROW_COUNT > 100M: - **β οΈ WARNING:** Large table - always add date filters}
## Relationships
{Inferred relationships based on column names like *_ID}
Command Options
| Option |
Effect |
/astronomer-data:warehouse-init |
Generate .astro/warehouse.md |
/astronomer-data:warehouse-init --refresh |
Regenerate, preserving user edits |
/astronomer-data:warehouse-init --database HQ |
Only discover specific database |
/astronomer-data:warehouse-init --global |
Write to ~/.astro/agents/ instead |
Step 7: Pre-populate Cache
After generating warehouse.md, populate the concept cache:
uv run cli.py concept import -p .astro/warehouse.md
uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
Step 8: Offer CLAUDE.md Integration (Ask User)
Ask the user:
Would you like to add the Quick Reference table to your CLAUDE.md file?
This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.
Options:
- Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section
- No, skip - Use warehouse.md and cache only
If user chooses Yes:
- Check if
.claude/CLAUDE.md or CLAUDE.md exists
- If exists, append the Quick Reference section (avoid duplicates)
- If not exists, create
.claude/CLAUDE.md with just the Quick Reference
Quick Reference section to add:
## Data Warehouse Quick Reference
When querying the warehouse, use these table mappings:
|---------|-------|------------|-------------|
{rows from warehouse.md Quick Reference}
**Large tables (always filter by date):** {list tables with >100M rows}
> Auto-generated by `/astronomer-data:warehouse-init`. Run `/astronomer-data:warehouse-init --refresh` to update.
If yes: Append the Quick Reference section to .claude/CLAUDE.md or CLAUDE.md.
After Generation
Tell the user:
Generated .astro/warehouse.md
Summary:
- {N} databases, {N} schemas, {N} tables
- {N} tables enriched with code descriptions
- {N} concepts cached for instant lookup
Next steps:
1. Edit .astro/warehouse.md to add business context
2. Commit to version cont