tooluniverse-polygenic-risk-score▌
mims-harvard/tooluniverse · updated Apr 8, 2026
Build and interpret polygenic risk scores for complex diseases using genome-wide association study (GWAS) data.
Polygenic Risk Score (PRS) Builder
Build and interpret polygenic risk scores for complex diseases using genome-wide association study (GWAS) data.
Reasoning Strategy
A polygenic risk score predicts genetic risk, not disease. A high PRS means elevated risk relative to the population — it does not mean the person will develop the condition, and a low PRS does not confer immunity. PRS performance varies dramatically across ancestries: a European-derived PRS applied to a West African population can lose 50–70% of its predictive power because the underlying GWAS was trained on European allele frequencies and LD patterns. Effect sizes from discovery GWAS are subject to winner's curse (overestimation in single studies); always prefer weights from large meta-analyses or validated PGS Catalog models. PRS should always be interpreted in the context of non-genetic risk factors — for most complex diseases, environmental factors contribute as much or more than genetics.
LOOK UP DON'T GUESS: Do not assume effect sizes, allele frequencies, or which SNPs are genome-wide significant for a trait — always query GWAS Catalog (gwas_get_associations_for_trait) for actual data. Do not assume a validated PRS model exists for a trait; check PGS Catalog via PubMed search.
Overview
Use Cases:
- "Calculate my genetic risk for type 2 diabetes"
- "Build a polygenic risk score for coronary artery disease"
- "What's my genetic predisposition to Alzheimer's disease?"
- "Interpret my PRS percentile for breast cancer risk"
What This Skill Does:
- Extracts genome-wide significant variants (p < 5e-8) from GWAS Catalog
- Builds weighted PRS models using effect sizes (beta coefficients)
- Calculates individual risk scores from genotype data
- Interprets PRS as population percentiles and risk categories
What This Skill Does NOT Do:
- Diagnose disease (PRS is probabilistic, not deterministic)
- Replace clinical assessment or genetic counseling
- Account for non-genetic factors (lifestyle, environment)
- Provide treatment recommendations
Methodology
PRS Calculation Formula
A polygenic risk score is calculated as a weighted sum across genetic variants:
PRS = Σ (dosage_i × effect_size_i)
Where:
- dosage_i: Number of effect alleles at SNP i (0, 1, or 2)
- effect_size_i: Beta coefficient or log(odds ratio) from GWAS
Standardization
Raw PRS is standardized to z-scores for interpretation:
z-score = (PRS - population_mean) / population_std
This allows comparison to population distribution and percentile calculation.
Significance Thresholds
- Genome-wide significance: p < 5×10⁻⁸ (default threshold)
- This corrects for ~1 million independent tests across the genome
- Relaxed thresholds (e.g., p < 1×10⁻⁵) can include more SNPs but may add noise
Effect Size Handling
- Continuous traits (e.g., height, BMI): Beta coefficient (units of trait per allele)
- Binary traits (e.g., disease): Odds ratio converted to log-odds (beta = ln(OR))
- Missing effect sizes or non-significant SNPs are excluded
Data Sources
This skill uses ToolUniverse GWAS tools to query:
-
GWAS Catalog (EMBL-EBI)
- Curated GWAS associations, 5000+ studies
- Tools:
gwas_search_associations(param:disease_trait,size; alsogwas_get_associations_for_trait),gwas_get_snps_for_gene(param:gene_symbol),dbsnp_get_variant_by_rsid - Note:
disease_traitsearch returns associations where the trait is one of potentially several linked EFO traits. For precise filtering, use EFO IDs viaefo_traitparam.
-
Open Targets Genetics
- Integrated genetics platform with fine-mapped credible sets
- Tools:
OpenTargets_search_gwas_studies_by_disease,EnsemblVEP_annotate_hgvs(for variant consequence/frequency)
-
Variant Annotation
gnomad_search_variants+gnomad_get_variant— population allele frequencies (ancestry-specific via VEP colocated_variants)MyVariant_query_variants— CADD, SIFT, PolyPhen, ClinVar, gnomAD in one callgnomad_get_gene_constraints— gene constraint metrics (pLI, oe_lof) for target prioritization
Key Concepts
Polygenic Risk Scores (PRS)
Polygenic risk scores aggregate the effects of many genetic variants to estimate an individual's genetic predisposition to a trait or disease. Unlike Mendelian diseases caused by single mutations, complex diseases involve hundreds to thousands of variants, each with small effects.
Key Properties:
- Continuous distribution: PRS forms a bell curve in populations
- Relative risk: Compares individual to population average
- Probabilistic: High PRS doesn't guarantee disease, low PRS doesn't guarantee protection
- Ancestry-specific: PRS accuracy depends on matching GWAS and target ancestry
GWAS (Genome-Wide Association Studies)
GWAS compare allele frequencies between cases and controls (or correlate with trait values) across millions of SNPs to identify disease-associated variants.
Study Design:
- Discovery cohort: Initial identification of associations
- Replication cohort: Validation in independent samples
- Sample size: Larger studies detect smaller effects (power ∝ √N)
- Multiple testing correction: Bonferroni-type correction for ~1M tests
Effect Sizes and Odds Ratios
- Beta (β): Change in trait per copy of effect allele
- Example: β = 0.5 kg/m² means each allele increases BMI by 0.5 units
- Odds Ratio (OR): Multiplicative change in disease odds
- OR = 1.5 means 50% increased odds per allele
- Convert to beta: β = ln(OR)
Linkage Disequilibrium (LD) and Clumping
Nearby variants are often inherited together (LD). To avoid double-counting:
- LD clumping: Select independent variants (r² < 0.1 within 1 Mb windows)
- Fine-mapping: Statistical methods to identify causal variants
- This skill uses raw associations; production PRS should include LD pruning
Population Stratification
GWAS and PRS are most accurate when ancestries match:
- Population structure: Different ancestries have different allele frequencies
- Transferability: European-trained PRS perform worse in non-European populations
- Solution: Train PRS on diverse cohorts or use ancestry-matched references
Applications
Clinical Risk Assessment
PRS can stratify individuals for:
- Screening programs: Target high-risk individuals (e.g., mammography, colonoscopy)
- Prevention strategies: Lifestyle interventions for high genetic risk
- Drug response: Pharmacogenomics based on metabolism genes
Example: Khera et al. (2018) showed PRS identifies 3× more individuals at >3-fold coronary artery disease risk than monogenic mutations.
Research Applications
- Gene discovery: PRS-based phenome-wide association studies (PheWAS)
- Genetic correlation: Compare PRS across traits
- Causal inference: Mendelian randomization using PRS as instruments
- Simulation studies: Model polygenic architecture
Personal Genomics
Consumer genetic testing (23andMe, Ancestry DNA) provides raw genotypes. Users can:
- Calculate PRS for traits not reported
- Compare to published PRS models
- Understand genetic contribution vs. lifestyle factors
Caution: Personal PRS should not replace medical advice. Results may cause anxiety if not properly contextualized.
Limitations and Considerations
- Heritability gap: PRS explains only a fraction of genetic heritability (T2D: ~50% heritable, PRS explains ~10–20%). Rare variants, epistasis, and gene-environment interactions are not captured.
- Ancestry bias: European-derived PRS performance drops substantially in non-European populations. Use multi-ancestry GWAS weights when available.
- Winner's curse: Discovery effect sizes are overestimated; use meta-analysis weights or PGS Catalog validated models.
- Not diagnostic: High PRS does not guarantee disease; low PRS does not guarantee protection. Environmental factors contribute equally or more for most complex diseases.
- Actionability varies: Alzheimer's PRS has limited actionable interventions; cardiovascular PRS can guide statin or lifestyle decisions. Always consider what the person can do with the information.
- Ethical: Genetic data is permanent and familial. GINA protects employment/health insurance in the US, but not life insurance. Provide genetic counseling context.
Workflow
1. Trait Selection
Identify the disease or trait of interest:
- Use standard terminology (e.g., "type 2 diabetes" not "T2D")
- Check GWAS Catalog for availability
- Verify sufficient GWAS studies exist (n > 10,000 samples ideal)
2. Association Collection
Query GWAS databases for genome-wide significant associations:
prs = build_polygenic_risk_score(
trait="coronary artery disease",
p_threshold=5e-8, # Genome-wide significance
max_snps=1000
)
Considerations:
- P-value threshold: 5e-8 is conservative, 1e-5 includes more variants
- LD clumping: Production systems should prune correlated SNPs
- Study quality: Prefer large meta-analyses over small studies
3. Effect Size Extraction
Extract beta coefficients or odds ratios:
- Beta for continuous traits (direct use)
- OR for binary traits (convert to log-odds)
- Handle missing values (exclude or impute from meta-analysis)
4. SNP Filtering
Quality control filters:
- MAF filter: Exclude rare variants (MAF < 0.01) for robustness
- Genotype QC: Remove SNPs with high missingness (> 10%)
- Hardy-Weinberg: Exclude SNPs violating HWE (p < 1e-6)
- Ambiguous SNPs: Remove A/T and G/C SNPs (strand ambiguity)
5. Score Calculation
Calculate weighted sum of genotype dosages:
result = calculate_personal_prs(
prs_weights=prs,
genotypes=my_genotypes,
population_mean=0.0,
population_std=1.0
)
Genotype Sources:
- 23andMe raw data export
- Ancestry DNA raw data
- Whole genome sequencing (VCF files)
- SNP array data (Illumina, Affymetrix)
6. Risk Interpretation
Convert to percentiles and risk categories:
result = interpret_prs_percentile(result)
print(f"Percentile: {result.percentile:.1f}%")
print(f"Risk: {result.risk_category}")
Risk Categories:
- Low risk: < 20th percentile (genetic protection)
- Average risk: 20-80th percentile (typical genetic predisposition)
- Elevated risk: 80-95th percentile (moderately increased risk)
- High risk: > 95th percentile (substantially increased risk)
Clinical Interpretation:
- Percentiles assume normal distribution
- Relative risk vs. average (not absolute risk)
- Combine with family history, clinical risk factors
- PRS is NOT diagnostic - many high-risk individuals never develop disease
Best Practices
- Use validated PRS from PGS Catalog when available (externally validated, includes LD clumping and ancestry-specific weights)
- Match ancestries between GWAS and target population; use multi-ancestry GWAS when available
- For highly polygenic traits (height, education), relaxed p-value thresholds capture more signal; for oligogenic traits (IBD, T1D), strict thresholds are better
- Combine PRS with clinical risk scores (Framingham, QRISK) for integrated prediction
- In research: document SNP selection criteria, LD clumping parameters, and ancestry of GWAS; validate in held-out cohorts; report R² or AUC stratified by ancestry
Disclaimer
This skill is for educational and research purposes only.
- Not for clinical diagnosis or treatment decisions
- Not validated for clinical use - use PGS Catalog models for clinical-grade PRS
- Requires genetic counseling - interpretation requires expertise
- Does not account for family history, environment, or lifestyle factors
- Ancestry-specific - accuracy depends on matching GWAS ancestry
For clinical genetic testing, consult:
- Genetic counselors (certified by ABGC/ABMGG)
- Medical geneticists
- Healthcare providers with genomics training
PRS is a rapidly evolving field. Guidelines and best practices will continue to change as research progresses.
Regulatory Status:
- FDA does not currently regulate PRS (as of 2024)
- Some countries restrict direct-to-consumer genetic risk reporting
- Check local regulations before clinical implementation