Productivity

tooluniverse-phylogenetics

mims-harvard/tooluniverse · updated Apr 8, 2026

$npx skills add https://github.com/mims-harvard/tooluniverse --skill tooluniverse-phylogenetics
summary

PhyKIT, Biopython, and DendroPy for alignment/tree analysis, evolutionary metrics, and comparative genomics.

skill.md

Phylogenetics and Sequence Analysis

PhyKIT, Biopython, and DendroPy for alignment/tree analysis, evolutionary metrics, and comparative genomics.

LOOK UP, DON'T GUESS

When uncertain about any scientific fact, SEARCH databases first.


When to Use

FASTA/PHYLIP/Nexus/Newick files; treeness, RCV, DVMC, evolutionary rate, parsimony sites, tree length, bootstrap; group comparisons (Mann-Whitney U); tree construction (NJ/UPGMA/parsimony); Robinson-Foulds distance.

BixBench: 33 questions across bix-4, bix-11, bix-12, bix-25, bix-35, bix-38, bix-45, bix-60.

NOT for: MSA generation (MUSCLE/MAFFT), ML trees (IQ-TREE/RAxML), Bayesian (MrBayes/BEAST).


Required Packages

import numpy as np, pandas as pd
from scipy import stats
from Bio import AlignIO, Phylo, SeqIO
from phykit.services.tree.treeness import Treeness
from phykit.services.tree.total_tree_length import TotalTreeLength
from phykit.services.tree.evolutionary_rate import EvolutionaryRate
from phykit.services.tree.dvmc import DVMC
from phykit.services.tree.treeness_over_rcv import TreenessOverRCV
from phykit.services.alignment.parsimony_informative_sites import ParsimonyInformative
from phykit.services.alignment.rcv import RelativeCompositionVariability
import dendropy

Workflow Decision Tree

ALIGNMENT ANALYSIS (FASTA/PHYLIP):
  Parsimony sites → phykit_parsimony_informative()
  RCV → phykit_rcv()
  Gap % → alignment_gap_percentage()

TREE ANALYSIS (Newick):
  Treeness → phykit_treeness()
  Tree length → phykit_tree_length()
  Evolutionary rate → phykit_evolutionary_rate()
  DVMC → phykit_dvmc()
  Bootstrap → extract_bootstrap_support()

COMBINED: Treeness/RCV → phykit_treeness_over_rcv(tree, aln)

TREE CONSTRUCTION: NJ → build_nj_tree(); UPGMA → build_upgma_tree(); Parsimony → build_parsimony_tree()

GROUP COMPARISON: batch metrics → Mann-Whitney U → summary stats

TREE COMPARISON: Robinson-Foulds → robinson_foulds_distance()

Quick Reference

Metric Input Description
Treeness Newick Internal / total branch length
RCV FASTA/PHYLIP Relative Composition Variability
Treeness/RCV Both Signal quality ratio
Tree Length Newick Sum of all branch lengths
Evolutionary Rate Newick Total length / num terminals
DVMC Newick Degree of Violation of Molecular Clock
Parsimony Sites FASTA/PHYLIP Sites with >=2 chars appearing >=2 times

Common Patterns

Single Metric Across Groups

fungi_dvmc = batch_dvmc(discover_gene_files("data/fungi"))
animal_dvmc = batch_dvmc(discover_gene_files("data/animals"))
print(f"Fungi median: {np.median(list(fungi_dvmc.values())):.4f}")

Statistical Comparison

u_stat, p_value = stats.mannwhitneyu(list(g1.values()), list(g2.values()), alternative='two-sided')

Filtering + Metric

Filter by gap percentage < 5%, then compute treeness/RCV on filtered set.

Batch Processing

gene_files = discover_gene_files("data/")  # → [{gene_id, aln_file, tree_file}]
treeness_results = batch_treeness(gene_files)  # → {gene_id: value}

Answer Extraction

Pattern Method
"median X" np.median(values)
"maximum X" np.max(values)
"difference in median" abs(np.median(a) - np.median(b))
"Mann-Whitney U" stats.mannwhitneyu(a, b)[0]
"fold-change" np.median(a) / np.median(b)

Rounding: PhyKIT default 4 decimals. U stats = integer. Question wording overrides.


Interpretation

Metric Good Acceptable Poor
Treeness >0.8 0.5-0.8 <0.5
RCV <0.2 0.2-0.5 >0.5
Treeness/RCV >2.0 1.0-2.0 <1.0
Bootstrap >95% 70-95% <70%
Parsimony sites >30% 10-30% <10%

Completeness Checklist

All files identified; group structure detected; correct PhyKIT function; ALL genes processed (not sample); correct test; 4-decimal rounding; specific statistic (median/max/U/p); Mann-Whitney alternative='two-sided'.


References

references/sequence_alignment.md, references/tree_building.md, references/parsimony_analysis.md, scripts/tree_statistics.py