scikit-bio
Overview
scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.
When to Use This Skill
This skill should be used when the user:
- Works with biological sequences (DNA, RNA, protein)
- Needs to read/write biological file formats (FASTA, FASTQ, GenBank, Newick, BIOM, etc.)
- Performs sequence alignments or searches for motifs
- Constructs or analyzes phylogenetic trees
- Calculates diversity metrics (alpha/beta diversity, UniFrac distances)
- Performs ordination analysis (PCoA, CCA, RDA)
- Runs statistical tests on biological/ecological data (PERMANOVA, ANOSIM, Mantel)
- Analyzes microbiome or community ecology data
- Works with protein embeddings from language models
- Needs to manipulate biological data tables
Core Capabilities
1. Sequence Manipulation
Work with biological sequences using specialized classes for DNA, RNA, and protein data.
Key operations:
- Read/write sequences from FASTA, FASTQ, GenBank, EMBL formats
- Sequence slicing, concatenation, and searching
- Reverse complement, transcription (DNAβRNA), and translation (RNAβprotein)
- Find motifs and patterns using regex
- Calculate distances (Hamming, k-mer based)
- Handle sequence quality scores and metadata
Common patterns:
import skbio
seq = skbio.DNA.read('input.fasta')
rc = seq.reverse_complement()
rna = seq.transcribe()
protein = rna.translate()
motif_positions = seq.find_with_regex('ATG[ACGT]{3}')
has_degens = seq.has_degenerates()
seq_no_gaps = seq.degap()
Important notes:
- Use
DNA, RNA, Protein classes for grammared sequences with validation
- Use
Sequence class for generic sequences without alphabet restrictions
- Quality scores automatically loaded from FASTQ files into positional metadata
- Metadata types: sequence-level (ID, description), positional (per-base), interval (regions/features)
2. Sequence Alignment
Perform pairwise and multiple sequence alignments using dynamic programming algorithms.
Key capabilities:
- Global alignment (Needleman-Wunsch with semi-global variant)
- Local alignment (Smith-Waterman)
- Configurable scoring schemes (match/mismatch, gap penalties, substitution matrices)
- CIGAR string conversion
- Multiple sequence alignment storage and manipulation with
TabularMSA
Common patterns:
from skbio.alignment import local_pairwise_align_ssw, TabularMSA
alignment = local_pairwise_align_ssw(seq1, seq2)
msa = alignment.aligned_sequences
msa = TabularMSA.read('alignment.fasta', constructor=skbio.DNA)
consensus = msa.consensus()
Important notes:
- Use
local_pairwise_align_ssw for local alignments (faster, SSW-based)
- Use
StripedSmithWaterman for protein alignments
- Affine gap penalties recommended for biological sequences
- Can convert between scikit-bio, BioPython, and Biotite alignment formats
3. Phylogenetic Trees
Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships.
Key capabilities:
- Tree construction from distance matrices (UPGMA, WPGMA, Neighbor Joining, GME, BME)
- Tree manipulation (pruning, rerooting, traversal)
- Distance calculations (patristic, cophenetic, Robinson-Foulds)
- ASCII visualization
- Newick format I/O
Common patterns:
from skbio import TreeNode
from skbio.tree import nj
tree = TreeNode.read('tree.nwk')
tree = nj(distance_matrix)
subtree = tree.shear(['taxon1', 'taxon2', 'taxon3'])
tips = [node for node in tree.tips()]
lca = tree.lowest_common_ancestor(['taxon1', 'taxon2'])
patristic_dist = tree.find('taxon1').distance(tree.find('taxon2'))
cophenetic_matrix = tree.cophenetic_matrix()
rf_distance = tree.robinson_foulds(other_tree)
Important notes:
- Use
nj() for neighbor joining (classic phylogenetic method)
- Use
upgma() for UPGMA (assumes molecular clock)
- GME and BME are highly scalable for large trees
- Trees can be rooted or unrooted; some metrics require specific rooting
4. Diversity Analysis
Calculate alpha and beta diversity metrics for microbial ecology and community analysis.
Key capabilities:
- Alpha diversity: richness, Shannon entropy, Simpson index, Faith's PD, Pielou's evenness
- Beta diversity: Bray-Curtis, Jaccard, weighted/unweighted UniFrac, Euclidean distances
- Phylogenetic diversity metrics (require tree input)
- Rarefaction and subsampling
- Integration with ordination and statistical tests
Common patterns:
from skbio.diversity import alpha_diversity, beta_diversity
import skbio
alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids)
faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids,
tree=tree, otu_ids=feature_ids)
bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids)
unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix,
ids=sample_ids, tree=tree, otu_ids=feature_ids)
from skbio.diversity import get_alpha_diversity_metrics
print(get_alpha_diversity_metrics())
Important notes:
- Counts must be integers representing abundances, not relative frequencies
- Phylogenetic metrics (Faith's PD, UniFrac) require tree and OTU ID mapping
- Use
partial_beta_diversity() for computing specific sample pairs only
- Alpha diversity returns Series, beta diversity returns DistanceMatrix
5. Ordination Methods
Reduce high-dimensional biological data to visualizable lower-dimensional spaces.
Key capabilities:
- PCoA (Principal Coordinate Analysis) from distance matrices
- CA (Correspondence Analysis) for contingency tables
- CCA (Canonical Correspondence Analysis) with environmental constraints
- RDA (Redundancy Analysis) for linear relationships
- Biplot projection for feature interpretation
Common patterns:
from skbio.stats.ordination import pcoa, cca
pcoa_results = pcoa(distance_matrix)
pc1 = pcoa_results.samples['PC1']
pc2 = pcoa_results.samples['PC2']
cca_results = cca(species_matrix, environmental_matrix)
pcoa_results.write('ordination.txt')
results = skbio.OrdinationResults.read('ordination.txt')
Important notes:
- PCoA works with any distance/dissimilarity matrix
- CCA reveals environmental drivers of community composition
- Ordination results include eigenvalues, proportion explained, and sample/feature coordinates
- Results integrate with plotting libraries (matplotlib, seaborn, plotly)
6. Statistical Testing
Perform hypothesis tests specific to ecological and biological data.
Key capabilities:
- PERMANOVA: test group differences using distance matrices
- ANOSIM: alternative test for group differences
- PERMDISP: test homogeneity of group dispersions
- Mantel test: correlation between distance matrices
- Bioenv: find environmental variables correlated with distances
Common patterns:
from skbio.stats.distance import permanova, anosim, mantel
permanova_results = permanova(distance_matrix, grouping, permutations=999)
print(f"p-value: {permanova_results['p-value']}")
anosim_results = anosim(distance_matrix, grouping, permutations=999)
mantel_results = mantel(dm1, dm2, method='pearson', permutations=999)
print(f"Correlation: {mantel_results[0]}, p-value: {mantel_results[1]}")
Important notes:
- Permutation tests provide non-parametric significance testing
- Use 999+ permutations for robust p-values
- PERMANOVA sensitive to dispersion differences; pair with PERMDISP
- Mantel tests assess matrix correlation (e.g., geographic vs genetic distance)
7. File I/O and Format Conversion
Read and write 19+ biological file formats with automatic format detection.
Supported formats:
- Sequences: FASTA, FASTQ, GenBank, EMBL, QSeq
- Alignments: Clustal, PHYLIP, Stockholm
- Trees: Newick
- Tables: BIOM (HDF5 and JSON)
- Distances: delimited square matrices
- Analysis: BLAST+6/7, GFF3, Ordination results
- Metadata: TSV/CSV with validation
Common patterns:
import skbio
seq = skbio