| name | tiledbvcf |
| description | Efficient storage and retrieval of genomic variant data using TileDB. Scalable VCF/BCF ingestion, incremental sample addition, compressed storage, parallel queries, and export capabilities for population genomics. |
| license | MIT license |
| metadata | version: "1.0" skill-author: Jeremy Leipzig |
TileDB-VCF
Overview
TileDB-VCF is a high-performance C++ library with Python and CLI interfaces for efficient storage and retrieval of genomic variant-call data. Built on TileDB's sparse array technology, it enables scalable ingestion of VCF/BCF files, incremental sample addition without expensive merging operations, and efficient parallel queries of variant data stored locally or in the cloud.
When to Use This Skill
This skill should be used when:
- Learning TileDB-VCF concepts and workflows
- Prototyping genomics analyses and pipelines
- Working with small-to-medium datasets (< 1000 samples)
- Need incremental addition of new samples to existing datasets
- Require efficient querying of specific genomic regions across many samples
- Working with cloud-stored variant data (S3, Azure, GCS)
- Need to export subsets of large VCF datasets
- Building variant databases for cohort studies
- Educational projects and method development
- Performance is critical for variant data operations
Quick Start
Installation
Preferred Method: Conda/Mamba
CONDA_SUBDIR=osx-64
conda config --env --set subdir osx-64
conda create -n tiledb-vcf "python<3.10"
conda activate tiledb-vcf
conda install -c conda-forge mamba
mamba install -y -c conda-forge -c bioconda -c tiledb tiledb-py tiledbvcf-py pandas pyarrow numpy
Alternative: Docker Images
docker pull tiledb/tiledbvcf-py
docker pull tiledb/tiledbvcf-cli
Basic Examples
Create and populate a dataset:
import tiledbvcf
ds = tiledbvcf.Dataset(uri="my_dataset", mode="w",
cfg=tiledbvcf.ReadConfig(memory_budget=1024))
ds.ingest_samples(["sample1.vcf.gz", "sample2.vcf.gz"])
Query variant data:
ds = tiledbvcf.Dataset(uri="my_dataset", mode="r")
df = ds.read(
attrs=["sample_name", "pos_start", "pos_end", "alleles", "fmt_GT"],
regions=["chr1:1000000-2000000", "chr2:500000-1500000"],
samples=["sample1", "sample2", "sample3"]
)
print(df.head())
Export to VCF:
import os
ds.export(
regions=["chr21:8220186-8405573"],
samples=["HG00101", "HG00097"],
output_format="v",
output_dir=os.path.expanduser("~"),
)
Core Capabilities
1. Dataset Creation and Ingestion
Create TileDB-VCF datasets and incrementally ingest variant data from multiple VCF/BCF files. This is appropriate for building population genomics databases and cohort studies.
Requirements:
- Single-sample VCFs only: Multi-sample VCFs are not supported
- Index files required: VCF/BCF files must have indexes (.csi or .tbi)
Common operations:
- Create new datasets with optimized array schemas
- Ingest single or multiple VCF/BCF files in parallel
- Add new samples incrementally without re-processing existing data
- Configure memory usage and compression settings
- Handle various VCF formats and INFO/FORMAT fields
- Resume interrupted ingestion processes
- Validate data integrity during ingestion
2. Efficient Querying and Filtering
Query variant data with high performance across genomic regions, samples, and variant attributes. This is appropriate for association studies, variant discovery, and population analysis.
Common operations:
- Query specific genomic regions (single or multiple)
- Filter by sample names or sample groups
- Extract specific variant attributes (position, alleles, genotypes, quality)
- Access INFO and FORMAT fields efficiently
- Combine spatial and attribute-based filtering
- Stream large query results
- Perform aggregations across samples or regions
3. Data Export and Interoperability
Export data in various formats for downstream analysis or integration with other genomics tools. This is appropriate for sharing datasets, creating analysis subsets, or feeding other pipelines.
Common operations:
- Export to standard VCF/BCF formats
- Generate TSV files with selected fields
- Create sample/region-specific subsets
- Maintain data provenance and metadata
- Lossless data export preserving all annotations
- Compressed output formats
- Streaming exports for large datasets
4. Population Genomics Workflows
TileDB-VCF excels at large-scale population genomics analyses requiring efficient access to variant data across many samples and genomic regions.
Common workflows:
- Genome-wide association studies (GWAS) data preparation
- Rare variant burden testing
- Population stratification analysis
- Allele frequency calculations across populations
- Quality control across large cohorts
- Variant annotation and filtering
- Cross-population comparative analysis
Key Concepts
Array Schema and Data Model
TileDB-VCF Data Model:
- Variants stored as sparse arrays with genomic coordinates as dimensions
- Samples stored as attributes allowing efficient sample-specific queries
- INFO and FORMAT fields preserved with original data types
- Automatic compression and chunking for optimal storage
Schema Configuration:
config = tiledbvcf.ReadConfig(
memory_budget=2048,
region_partition=(0, 3095677412),
sample_partition=(0, 10000)
)
Coordinate Systems and Regions
Critical: TileDB-VCF uses 1-based genomic coordinates following VCF standard:
- Positions are 1-based (first base is position 1)
- Ranges are inclusive on both ends
- Region "chr1:1000-2000" includes positions 1000-2000 (1001 bases total)
Region specification formats:
regions = ["chr1:1000000-2000000"]
regions = ["chr1:1000000-2000000", "chr2:500000-1500000"]
regions = ["chr1"]
regions = ["chr1:999999-2000000"]
Memory Management
Performance considerations:
- Set appropriate memory budget based on available system memory
- Use streaming queries for very large result sets
- Partition large ingestions to avoid memory exhaustion
- Configure tile cache for repeated region access
- Use parallel ingestion for multiple files
- Optimize region queries by combining nearby regions
Cloud Storage Integration
TileDB-VCF seamlessly works with cloud storage:
ds = tiledbvcf.Dataset(uri="s3://bucket/dataset", mode="r")
ds = tiledbvcf.Dataset(uri="azure://container/dataset", mode="r")
ds = tiledbvcf.Dataset(uri="gcs://bucket/dataset", mode="r")
Common Pitfalls
- Memory exhaustion during ingestion: Use appropriate memory budget and batch processing for large VCF files
- Inefficient region queries: Combine nearby regions instead of many separate queries
- Missing sample names: Ensure sample names in VCF headers match query sample specifications
- Coordinate system confusion: Remember TileDB-VCF uses 1-based coordinates like VCF standard
- Large result sets: Use streaming or pagination for queries returning millions of variants
- Cloud permissions: Ensure proper authentication for cloud storage access
- Concurrent access: Multiple writers to the same dataset can cause corruptionโuse appropriate locking
CLI Usage
TileDB-VCF provides a command-line interface with the following subcommands:
Available Subcommands:
create - Creates an empty TileDB-VCF dataset
store - Ingests samples into a TileDB-VCF dataset
export - Exports data from a TileDB-VCF dataset
list - Lists all sample names present in a TileDB-VCF dataset
stat - Prints high-level statistics about a TileDB-VCF dataset
utils - Utils for working with a TileDB-VCF dataset
version - Print the version information and exit
tiledbvcf create --uri my_dataset
tiledbvcf store --uri my_dataset --samples sample1.vcf.gz,sample2.vcf.gz
tiledbvcf export --uri my_dataset \
--regions "chr1:1000000-2000000" \
--sample-names "sample1,sample2"
tiledbvcf list --uri my_dataset
tiledbvcf stat --uri my_dataset
Advanced Features
Allele Frequency Analysis
af_df = tiledbvcf.read_allele_frequency(
uri="my_dataset",
regions=["chr1:1000000-2000000"],
samples=["sample1", "sample2", "sample3"]
)
Sample Quality Control
qc_results = tiledbvcf.sample_qc(
uri="my_dataset",
samples=["sample1", "sample2"]
)
Custom Configurations
config = tiledbvcf.ReadConfig(
memory_budget=4096,
tiledb_config={
"sm.tile_cache_size": "1000000000",
"vfs.s3.region": "us-east-1"
}
)
Resources
Getting Help
Open Source TileDB-VCF Resources
Open Source Documentation:
TileDB-Cloud Resources
For Large-Scale/Production Genomics:
Getting Started:
Scaling to TileDB-Cloud
When your genomics workloads outgrow single-node processing, TileDB-Cloud provides enterprise-scale capabilities for production genomics pipelines.
Note: This section covers TileDB-Cloud capabilities based on available documentation. For complete API details and current functionality, consult the official TileDB-Cloud documentation and API reference.
Setting Up TileDB-Cloud
1. Create Account and Get API Token
2. Install TileDB-Cloud Python Client
pip install tiledb-cloud
pip install tiledb-cloud[life-sciences]
3. Configure Authentication
export TILEDB_REST_TOKEN="your_api_token"
import tiledb.cloud
Migrating from Open Source to TileDB-Cloud
Large-Scale Ingestion
import tiledb.cloud.vcf
tiledb.cloud.vcf.ingestion.ingest_vcf_dataset(
source="s3://my-bucket/vcf-files/",
output="tiledb://my-namespace/large-dataset",
namespace="my-namespace",
acn="my-s3-credentials",
ingest_resources={"cpu": "16", "memory": "64Gi"}
)
Distributed Query Processing
import tiledb.cloud.vcf
import tiledbvcf
dataset_uri = "tiledb://TileDB-Inc/gvcf-1kg-dragen-v376"
ds = tiledbvcf.Dataset(dataset_uri, tiledb_config=cfg)
samples = ds.samples()
attrs = ["sample_name", "fmt_GT", "fmt_AD", "fmt_DP"]
regions = ["chr13:32396898-32397044", "chr13:32398162-32400268"]
df = tiledb.cloud.vcf.read(
dataset_uri=dataset_uri,
regions=regions,
samples=samples,
attrs=attrs,
namespace="my-namespace",
)
df.to_pandas()
Enterprise Features
Data Sharing and Collaboration
dataset_uri = "tiledb://shared-namespace/population-study"
Cost Optimization
- Serverless Compute: Pay only for actual compute time
- Auto-scaling: Automatically scale up/down based on workload
- Spot Instances: Use cost-optimized compute for batch jobs
- Data Tiering: Automatic hot/cold storage management
Security and Compliance
- End-to-end Encryption: Data encrypted in transit and at rest
- Access Controls: Fine-grained permissions and audit logs
- HIPAA/SOC2 Compliance: Enterprise security standards
- VPC Support: Deploy in private cloud environments
When to Migrate Checklist
โ
Migrate to TileDB-Cloud if you have:
Getting Started with TileDB-Cloud
- Start Free: TileDB-Cloud offers free tier for evaluation
- Migration Support: TileDB team provides migration assistance
- Training: Access to genomics-specific tutorials and examples
- Professional Services: Custom deployment and optimization
Next Steps: