Productivity

tooluniverse-multi-omics-integration

mims-harvard/tooluniverse · updated Apr 8, 2026

$npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-multi-omics-integration
summary

Coordinate and integrate multiple omics datasets for comprehensive systems biology analysis. Orchestrates specialized ToolUniverse skills to perform cross-omics correlation, multi-omics clustering, pathway-level integration, and unified interpretation.

skill.md

Multi-Omics Integration

Coordinate and integrate multiple omics datasets for comprehensive systems biology analysis. Orchestrates specialized ToolUniverse skills to perform cross-omics correlation, multi-omics clustering, pathway-level integration, and unified interpretation.


Domain Reasoning

Multi-omics integration asks whether different molecular layers tell a concordant story. If a gene is upregulated in RNA-seq AND its protein is elevated in proteomics, that is concordant evidence of true biological change. Discordance — high mRNA but low protein, or elevated protein without matching mRNA — may indicate post-transcriptional regulation (miRNA silencing, protein degradation, translational control) and is itself a meaningful finding worth reporting. Not every discordance is noise; some are the most interesting biology.

LOOK UP DON'T GUESS

  • Expected RNA-protein correlation ranges: compute Spearman r from the actual data; the typical range (0.4-0.6) is a guide, not a guarantee.
  • Pathway enrichment results: run ReactomeAnalysis_pathway_enrichment or gseapy on the actual gene lists; never list enriched pathways from memory.
  • eQTL associations: query GTEx or eQTL databases for the specific variant and tissue; do not assume regulatory relationships.
  • Methylation-expression directionality at specific loci: retrieve experimental data; promoter repression is the canonical model but exceptions exist.

When to Use This Skill

  • User has multiple omics datasets (RNA-seq + proteomics, methylation + expression, etc.)
  • Cross-omics correlation queries (e.g., "How does methylation affect expression?")
  • Multi-omics biomarker discovery or patient subtyping
  • Systems biology questions requiring multiple molecular layers
  • Precision medicine applications with multi-omics patient data

Workflow Overview

Phase 1: Data Loading & QC
  Load each omics type, format-specific QC, normalize
  Supported: RNA-seq, proteomics, methylation, CNV/SNV, metabolomics

Phase 2: Sample Matching
  Harmonize sample IDs, find common samples, handle missing omics

Phase 3: Feature Mapping
  Map features to common gene-level identifiers
  CpG->gene (promoter), CNV->gene, metabolite->enzyme

Phase 4: Cross-Omics Correlation
  RNA vs Protein (translation efficiency)
  Methylation vs Expression (epigenetic regulation)
  CNV vs Expression (dosage effect)
  eQTL variants vs Expression (genetic regulation)

Phase 5: Multi-Omics Clustering
  MOFA+, NMF, SNF for patient subtyping

Phase 6: Pathway-Level Integration
  Aggregate omics evidence at pathway level
  Score pathway dysregulation with combined evidence

Phase 7: Biomarker Discovery
  Feature selection across omics, multi-omics classification

Phase 8: Integrated Report
  Summary, correlations, clusters, pathways, biomarkers

See: phase_details.md for complete code and implementation details.


Supported Data Types

Omics Formats QC Focus
Transcriptomics CSV/TSV, HDF5, h5ad Low-count filter, normalize (TPM/DESeq2), log-transform
Proteomics MaxQuant, Spectronaut, DIA-NN Missing value imputation, median/quantile normalization
Methylation IDAT, beta matrices Failed probes, batch correction, cross-reactive filter
Genomics VCF, SEG (CNV) Variant QC, CNV segmentation
Metabolomics Peak tables Missing values, normalization

Core Operations

Sample Matching

def match_samples_across_omics(omics_data_dict):
    """Match samples across multiple omics datasets."""
    sample_ids = {k: set(df.columns) for k, df in omics_data_dict.items()}
    common_samples = set.intersection(*sample_ids.values())
    matched_data = {k: df[sorted(common_samples)] for k, df in omics_data_dict.items()}
    return sorted(common_samples), matched_data

Cross-Omics Correlation

from scipy.stats import spearmanr, pearsonr

# RNA vs Protein: expect positive r ~ 0.4-0.6
# Methylation vs Expression: expect negative r (promoter repression)
# CNV vs Expression: expect positive r (dosage effect)

for gene in common_genes:
    r, p = spearmanr(rna[gene], protein[gene])

Pathway Integration

# Score pathway dysregulation using combined evidence from all omics
# Aggregate per-gene evidence, then per-pathway
pathway_score = mean(abs(rna_fc) + abs(protein_fc) + abs(meth_diff) + abs(cnv))

See: phase_details.md for full implementations of each operation.


Multi-Omics Clustering Methods

Method Description Best For
MOFA+ Latent factors explaining cross-omics variation Identifying shared/omics-specific drivers
Joint NMF Shared decomposition across omics Patient subtype discovery
SNF Similarity network fusion Integrating heterogeneous data types

ToolUniverse Skills Coordination

Skill Used For Phase
tooluniverse-rnaseq-deseq2 RNA-seq analysis 1, 4
tooluniverse-epigenomics Methylation, ChIP-seq 1, 4
tooluniverse-variant-analysis CNV/SNV processing 1, 3, 4
tooluniverse-protein-interactions Protein network context 6
tooluniverse-gene-enrichment Pathway enrichment 6
tooluniverse-expression-data-retrieval Public data retrieval 1
tooluniverse-target-research Gene/protein annotation 3, 8

Use Cases

Cancer Multi-Omics

Integrate TCGA RNA-seq + proteomics + methylation + CNV to identify patient subtypes, cross-omics driver genes, and multi-omics biomarkers.

eQTL + Expression + Methylation

Identify SNP -> methylation -> expression regulatory chains (mediation analysis).

Drug Response Multi-Omics

Predict drug response using baseline multi-omics profiles; identify resistance/sensitivity pathways.

See: phase_details.md "Use Cases" for detailed step-by-step workflows.


Quantified Minimums

Component Requirement
Omics types At least 2 datasets
Common samples At least 10 across omics
Cross-correlation Pearson/Spearman computed
Clustering At least one method (MOFA+, NMF, or SNF)
Pathway integration Enrichment with multi-omics evidence scores
Report Summary, correlations, clusters, pathways, biomarkers

Limitations

  • Sample size: n >= 20 recommended for integration
  • Missing data: Pairwise integration if not all samples have all omics
  • Batch effects: Different platforms require careful normalization
  • Computational: Large datasets may require significant memory
  • Interpretation: Results require domain expertise for validation

References


Detailed Reference

  • phase_details.md - Complete code for all phases, correlation functions, clustering, pathway integration, biomarker discovery, report template, and detailed use cases