tooluniverse-expression-data-retrieval

mims-harvard/tooluniverse · updated Apr 8, 2026

$npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-expression-data-retrieval
0 commentsdiscussion
summary

Retrieve gene expression experiments and multi-omics datasets with disambiguation and quality assessment.

skill.md

Gene Expression & Omics Data Retrieval

Retrieve gene expression experiments and multi-omics datasets with disambiguation and quality assessment.

IMPORTANT: Always use English terms in tool calls. Respond in the user's language.

LOOK UP DON'T GUESS: Never assume which datasets exist or their accessions. Always search to confirm.

Domain Reasoning

Before retrieving, determine: organism, tissue, experimental design (case-control/time-series/dose-response). These affect which database to search and how to interpret results. RNA-seq provides wider dynamic range; microarray has extensive legacy data. Prioritize experiments with >=3 biological replicates, complete annotations, and both raw+processed data.

Workflow

Phase 0: Clarify (if ambiguous) → Phase 1: Disambiguate → Phase 2: Search & Retrieve → Phase 3: Report

Phase 0: Clarification (When Needed)

Ask ONLY if: gene name ambiguous, tissue/condition unclear, organism not specified. Skip for: specific accessions (E-MTAB-, E-GEOD-, S-BSST*), clear disease/tissue+organism, explicit platform requests.


Phase 1: Query Disambiguation

Resolve official gene symbol (HGNC for human, MGI for mouse). Note common aliases for search expansion.

User Query Type Search Strategy
Specific accession Direct retrieval
Gene + condition "[gene] [condition]" + species filter
Disease only "[disease]" + species filter
Technology-specific Add platform keywords

Phase 2: Data Retrieval (Internal)

Search silently. Do NOT narrate the process.

# ArrayExpress search
result = tu.tools.arrayexpress_search_experiments(keywords="[gene/disease]", species="[species]", limit=20)

# Get experiment details, samples, files
details = tu.tools.arrayexpress_get_experiment(accession=accession)
samples = tu.tools.arrayexpress_get_experiment_samples(accession=accession)
files = tu.tools.arrayexpress_get_experiment_files(accession=accession)

# BioStudies for multi-omics
biostudies = tu.tools.biostudies_search(query="[keywords]", limit=10)
study = tu.tools.biostudies_get_study(accession=study_accession)
study_files = tu.tools.biostudies_get_study_files(accession=study_accession)

Fallback Chains

Primary Fallback
ArrayExpress search BioStudies search
arrayexpress_get_experiment biostudies_get_study
arrayexpress_get_experiment_files Note "Files unavailable"

Phase 3: Report Dataset Profile

Present as a Dataset Search Report. Hide search process. Include:

  1. Search Summary: query, databases searched, result count
  2. Top Experiments (per experiment):
    • Accession, organism, type (RNA-seq/microarray), platform, sample count, date
    • Description, experimental design (conditions, replicates, tissue)
    • Sample groups table, data files table
    • Quality assessment (●●●/●●○/●○○)
  3. Multi-Omics Studies (from BioStudies): accession, type, data types included
  4. Summary Table: all experiments ranked
  5. Recommendations: best dataset for user's purpose, integration notes
  6. Data Access: download links, database URLs

Data Quality Tiers

Tier Symbol Criteria
High ●●● >=3 bio replicates, complete metadata, processed data available
Medium ●●○ 2-3 replicates OR some metadata gaps
Low ●○○ No replicates, sparse metadata, or access issues
Caution ○○○ Single sample, no replication, outdated platform

Reasoning Framework

Dataset quality: Prioritize >=3 biological replicates, complete annotations, both raw+processed data. Single-replicate experiments can inform but not be sole evidence.

Platform comparison: RNA-seq = wider dynamic range, novel transcripts. Microarray = probe-limited but extensive legacy data. Cross-platform combining requires batch correction.

Metadata scoring: Rate 0-5 on: (1) sample annotations, (2) design documented, (3) pipeline described, (4) raw data deposited, (5) publication linked. Score <=2 warrants caution.

GEO vs ArrayExpress: GEO has broader coverage (older studies); ArrayExpress enforces stricter metadata. BioStudies captures multi-omics. Search both.

Synthesis Questions

  1. Does the dataset have sufficient replication and metadata for the intended analysis?
  2. Are there batch effects or confounding variables?
  3. Do multiple datasets show concordant patterns, and can they be integrated?

Error Handling

Error Response
"No experiments found" Broaden keywords, remove species filter, try synonyms
"Accession not found" Verify format, check if withdrawn
"Files not available" Note: "Data files restricted by submitter"
"API timeout" Retry once, note "(metadata retrieval incomplete)"

Tool Reference

ArrayExpress: arrayexpress_search_experiments (search), arrayexpress_get_experiment (metadata), arrayexpress_get_experiment_files (downloads), arrayexpress_get_experiment_samples (annotations)

BioStudies: biostudies_search (search), biostudies_get_study (metadata+sections), biostudies_get_study_files (files)

Additional Sources:

  • GEO_search_rnaseq_datasets / geo_search_datasets -- GEO (largest RNA-seq repo)
  • OmicsDI_search_datasets -- cross-repository aggregation (GEO+ArrayExpress+PRIDE+MassIVE)
  • GTEx_get_expression_summary -- baseline tissue expression (54 normal tissues, param: gene_symbol)
  • ENAPortal_search_studies -- sequencing studies (param: query with description="...")
  • CxGDisc_search_datasets -- single-cell datasets (needs exact disease ontology terms)
  • PubMed_search_articles -- dataset discovery via publications

Search Parameters

ArrayExpress: keywords (free text), species (scientific name), array (platform filter), limit BioStudies: query (free text), limit

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.873 reviews
  • Ren Harris· Dec 28, 2024

    tooluniverse-expression-data-retrieval reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Ren Liu· Dec 28, 2024

    Useful defaults in tooluniverse-expression-data-retrieval — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Dhruvi Jain· Dec 24, 2024

    tooluniverse-expression-data-retrieval reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Hassan Johnson· Dec 24, 2024

    tooluniverse-expression-data-retrieval is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Michael Mensah· Dec 20, 2024

    tooluniverse-expression-data-retrieval fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Hassan Smith· Dec 16, 2024

    I recommend tooluniverse-expression-data-retrieval for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Michael Wang· Dec 16, 2024

    tooluniverse-expression-data-retrieval has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Michael Okafor· Nov 27, 2024

    We added tooluniverse-expression-data-retrieval from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Rahul Santra· Nov 23, 2024

    We added tooluniverse-expression-data-retrieval from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Sakura Rahman· Nov 19, 2024

    I recommend tooluniverse-expression-data-retrieval for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

showing 1-10 of 73

1 / 8