scikit-bio▌
K-Dense-AI/scientific-agent-skills · updated Jun 4, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
### Scikit Bio
- ›name: "scikit-bio"
- ›description: "Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis."
- ›allowed-tools: "Read Write Edit Bash"
| name | scikit-bio |
| description | Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis. |
| license | BSD-3-Clause license |
| allowed-tools | Read Write Edit Bash |
| compatibility | Requires Python 3.10+ and scikit-bio 0.7+ (uv pip install scikit-bio). NumPy 2.0+ is required. Optional matplotlib/seaborn/plotly for plotting; biom-format for BIOM tables; polars/anndata for table interoperability. |
| metadata | version: "1.1" skill-author: K-Dense Inc. |
scikit-bio
Overview
scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.
When to Use This Skill
This skill should be used when the user:
- Works with biological sequences (DNA, RNA, protein)
- Needs to read/write biological file formats (FASTA, FASTQ, GenBank, Newick, BIOM, etc.)
- Performs sequence alignments or searches for motifs
- Constructs or analyzes phylogenetic trees
- Calculates diversity metrics (alpha/beta diversity, UniFrac distances)
- Performs ordination analysis (PCoA, CCA, RDA)
- Runs statistical tests on biological/ecological data (PERMANOVA, ANOSIM, Mantel)
- Analyzes microbiome or community ecology data
- Works with protein embeddings from language models
- Needs to manipulate biological data tables
Core Capabilities
1. Sequence Manipulation
Work with biological sequences using specialized classes for DNA, RNA, and protein data.
Key operations:
- Read/write sequences from FASTA, FASTQ, GenBank, EMBL formats
- Sequence slicing, concatenation, and searching
- Reverse complement, transcription (DNA→RNA), and translation (RNA→protein)
- Find motifs and patterns using regex
- Calculate distances (Hamming, k-mer based)
- Handle sequence quality scores and metadata
Common patterns:
import skbio
# Read sequences from file
seq = skbio.DNA.read('input.fasta')
# Sequence operations
rc = seq.reverse_complement()
rna = seq.transcribe()
protein = rna.translate()
# Find motifs
motif_positions = seq.find_with_regex('ATG[ACGT]{3}')
# Check for properties
has_degens = seq.has_degenerates()
seq_no_gaps = seq.degap()
Important notes:
- Use
DNA,RNA,Proteinclasses for grammared sequences with validation - Use
Sequenceclass for generic sequences without alphabet restrictions - Quality scores automatically loaded from FASTQ files into positional metadata
- Metadata types: sequence-level (ID, description), positional (per-base), interval (regions/features)
2. Sequence Alignment
Perform pairwise and multiple sequence alignments using the pair_align engine (introduced in scikit-bio 0.7.0), a versatile and efficient dynamic-programming aligner.
Key capabilities:
- Global, local, and semi-global alignment (free ends configurable) in one function
- Convenience wrappers
pair_align_nucl(BLASTN-like) andpair_align_prot(BLASTP-like) - Configurable scoring: match/mismatch tuple or named substitution matrix; linear or affine gap penalties
PairAlignPathresults carry CIGAR strings and convert to aligned sequences- Multiple sequence alignment storage and manipulation with
TabularMSA
Common patterns:
from skbio import DNA, Protein
from skbio.alignment import pair_align_nucl, pair_align_prot, pair_align, TabularMSA
# Nucleotide alignment with BLASTN-like defaults
seq1, seq2 = DNA('ACTACCAGATTACTTACGGATCAGG'), DNA('CGAAACTACTAGATTACGGATCTTA')
aln = pair_align_nucl(seq1, seq2)
aln.score # alignment score (float)
path = aln.paths[0] # PairAlignPath (repr shows CIGAR)
aligned_seqs = path.to_aligned((seq1, seq2)) # list of gapped strings
# Build a TabularMSA from the alignment path + original sequences
msa = TabularMSA.from_path_seqs(path, (seq1, seq2))
# Customize the algorithm via pair_align (default mode='global')
aln = pair_align(seq1, seq2, mode='local') # Smith-Waterman
aln = pair_align(seq1, seq2, sub_score=(2, -3), gap_cost=(5, 2)) # affine gaps
aln = pair_align(seq1, seq2, sub_score='NUC.4.4', gap_cost=3) # substitution matrix, linear gap
# Protein alignment (BLASTP-like, BLOSUM62)
aln = pair_align_prot(Protein('HEAGAWGHEE'), Protein('PAWHEAE'))
# Read a multiple alignment from file and summarize
msa = TabularMSA.read('alignment.fasta', constructor=DNA)
consensus = msa.consensus()
Important notes:
pair_alignreplaces the removed SSW wrapper (local_pairwise_align_ssw,StripedSmithWaterman) and the deprecated pure-Python aligners (global_pairwise_align,local_pairwise_align_nucleotide, etc.)- The result is a
PairAlignResultthat also unpacks asscore, paths, matrices(usekeep_matrices=Trueto retain the DP matrix) sub_scoreaccepts a(match, mismatch)tuple or a matrix name (e.g.,'NUC.4.4','BLOSUM62');gap_costaccepts a single number (linear) or(open, extend)tuple (affine)- Parse external CIGAR strings with
PairAlignPath.from_cigar('1I8M2D5M2I'); score an existing alignment withalign_score(...)and build a distance matrix from an MSA withalign_dists(...)
3. Phylogenetic Trees
Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships.
Key capabilities:
- Tree construction from distance matrices (UPGMA/WPGMA, Neighbor Joining, GME, BME)
- Tree rearrangement with nearest neighbor interchange (
nni) - Tree manipulation (pruning, rerooting, traversal)
- Distance calculations (patristic via
cophenet, Robinson-Foulds viacompare_rfd) - ASCII visualization
- Newick format I/O
Common patterns:
from skbio import TreeNode
from skbio.tree import nj, upgma, gme, bme, rf_dists
# Read tree from file
tree = TreeNode.read('tree.nwk')
# Construct tree from distance matrix
tree = nj(distance_matrix)
# Tree operations
subtree = tree.shear(['taxon1', 'taxon2', 'taxon3'])
tips = [node for node in tree.tips()]
lca = tree.lca(['taxon1', 'taxon2'])
# Calculate distances
patristic_dist = tree.find('taxon1').distance(tree.find('taxon2'))
cophenetic_dm = tree.cophenet() # patristic distance matrix among tips
# Compare two trees (Robinson-Foulds)
rf_distance = tree.compare_rfd(other_tree)
# Pairwise RF distances among many trees -> DistanceMatrix
rf_dm = rf_dists([tree, other_tree, third_tree])
Important notes:
- Use
nj()for neighbor joining (classic phylogenetic method) - Use
upgma()for UPGMA/WPGMA (assumes molecular clock) - GME and BME are highly scalable for large trees; refine topology with
nni() cophenet()(formerlytip_tip_distances) returns the patristic distance matrix;compare_rfd()is the Robinson-Foulds method (compare_wrfd/compare_cophenetfor weighted/cophenetic variants)lca()is the lowest common ancestor;lowest_common_ancestorremains as an alias- Trees can be rooted or unrooted; some metrics require specific rooting
4. Diversity Analysis
Calculate alpha and beta diversity metrics for microbial ecology and community analysis.
Key capabilities:
- Alpha diversity: richness (
sobs,observed_features,chao1,ace), Shannon, Simpson, Hill numbers (hill), Faith's PD (faith_pd), generalized PD (phydiv), Pielou's evenness - Beta diversity: Bray-Curtis, Jaccard, weighted/unweighted UniFrac, Euclidean distances
- Phylogenetic diversity metrics (require tree input)
- Rarefaction and subsampling
- Integration with ordination and statistical tests
Common patterns:
from skbio.diversity import alpha_diversity, beta_diversity
# Alpha diversity (phylogenetic metrics take taxa= for tip-name mapping)
alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids)
faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids,
tree=tree, taxa=feature_ids)
# Beta diversity
bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids)
unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix,
ids=sample_ids, tree=tree, taxa=feature_ids)
# Get available metrics
from skbio.diversity import get_alpha_diversity_metrics
print(get_alpha_diversity_metrics())
Important notes:
- Counts must be integers representing abundances, not relative frequencies
- The phylogenetic-metric argument is
taxa=(renamed fromotu_idsin 0.6.0; the old name is a deprecated alias);observed_otusis nowobserved_features(orsobs) counts_matrixmay be any table-like input (NumPy array, pandas/polars DataFrame, BIOMTable, or AnnData) via the dispatch system- Phylogenetic metrics (Faith's PD, UniFrac) require tree and taxa-to-tip mapping
- Use
partial_beta_diversity()for specific sample pairs, orblock_beta_diversity()for large block-decomposed calculations - Alpha diversity returns a
pandas.Series, beta diversity returns aDistanceMatrix
5. Ordination Methods
Reduce high-dimensional biological data to visualizable lower-dimensional spaces.
Key capabilities:
- PCoA (Principal Coordinate Analysis) from distance matrices
- CA (Correspondence Analysis) for contingency tables
- CCA (Canonical Correspondence Analysis) with environmental constraints
- RDA (Redundancy Analysis) for linear relationships
- Biplot projection for feature interpretation
Common patterns:
from skbio.stats.ordination import pcoa, cca
import skbio
# PCoA from distance matrix (limit dimensions for large matrices)
pcoa_results = pcoa(distance_matrix, dimensions=3)
pc1 = pcoa_results.samples['PC1']
pc2 = pcoa_results.samples['PC2']
# Built-in scatter plot colored by a metadata column
fig = pcoa_results.plot(sample_metadata, column='bodysite')
# CCA with environmental variables
cca_results = cca(species_matrix, environmental_matrix)
# Save/load ordination results
pcoa_results.write('ordination.txt')
results = skbio.OrdinationResults.read('ordination.txt')
Important notes:
- PCoA works with any distance/dissimilarity matrix; pass
dimensionsas an int (count) or a float in (0, 1] (fraction of cumulative variance to retain) OrdinationResultsexposes pandas-based attributes:samples,features,eigvals,proportion_explained,biplot_scores,sample_constraints- CCA reveals environmental drivers of community composition
OrdinationResults.plot()produces a matplotlib figure; results also integrate with seaborn/plotly
6. Statistical Testing
Perform hypothesis tests specific to ecological and biological data.
Key capabilities:
- PERMANOVA: test group differences using distance matrices
- ANOSIM: alternative test for group differences
- PERMDISP: test homogeneity of group dispersions
- Mantel test: correlation between distance matrices
- Bioenv: find environmental variables correlated with distances
- Differential abundance:
ancom,dirmult_ttest, anddirmult_lme(longitudinal mixed-effects) inskbio.stats.composition
Common patterns:
from skbio.stats.distance import permanova, anosim, mantel
# Test if groups differ significantly
permanova_results = permanova(distance_matrix, grouping, permutations=999)
print(f"p-value: {permanova_results['p-value']}")
# ANOSIM test
anosim_results = anosim(distance_matrix, grouping, permutations=999)
# Mantel test between two distance matrices
mantel_results = mantel(dm1, dm2, method='pearson', permutations=999)
print(f"Correlation: {mantel_results[0]}, p-value: {mantel_results[1]}")
# Differential abundance on a feature table (raw counts recommended)
from skbio.stats.composition import dirmult_ttest
da = dirmult_ttest(counts_table, grouping, treatment='caseA', reference='control')
Important notes:
- Permutation tests provide non-parametric significance testing
- Use 999+ permutations for robust p-values
- PERMANOVA sensitive to dispersion differences; pair with PERMDISP
- Mantel tests assess matrix correlation (e.g., geographic vs genetic distance)
- Supply differential-abundance tests with raw counts, not pre-normalized proportions, to preserve magnitude information
7. File I/O and Format Conversion
Read and write 19+ biological file formats with automatic format detection.
Supported formats:
- Sequences: FASTA, FASTQ, GenBank, EMBL, QSeq
- Alignments: Clustal, PHYLIP, Stockholm
- Trees: Newick
- Tables: BIOM (HDF5 and JSON)
- Distances: delimited square matrices
- Analysis: BLAST+6/7, GFF3, Ordination results
- Metadata: TSV/CSV with validation
Common patterns:
import skbio
# Read with automatic format detection
seq = skbio.DNA.read('file.fasta', format='fasta')
tree = skbio.TreeNode.read('tree.nwk')
# Write to file
seq.write('output.fasta', format='fasta')
# Generator for large files (memory efficient)
for seq in skbio.io.read('large.fasta', format='fasta', constructor=skbio.DNA):
process(seq)
# Convert formats
seqs = list(skbio.io.read('input.fastq', format='fastq', constructor=skbio.DNA))
skbio.io.write(seqs, format='fasta', into='output.fasta')
Important notes:
- Use generators for large files to avoid memory issues
- Format can be auto-detected when
intoparameter specified - Some objects can be written to multiple formats
- Support for stdin/stdout piping with
verify=False
8. Distance Matrices
Create and manipulate distance/dissimilarity matrices with statistical methods.
Key capabilities:
- Store symmetric (
DistanceMatrix, hollow diagonal) or general pairwise (PairwiseMatrix) data - ID-based indexing and slicing
- Integration with diversity, ordination, and statistical tests
- Read/write delimited text format
Common patterns:
from skbio import DistanceMatrix
import numpy as np
# Create from array
data = np.array([[0, 1, 2], [1, 0, 3], [2, 3, 0]])
dm = DistanceMatrix(data, ids=['A', 'B', 'C'])
# Access distances
dist_ab = dm['A', 'B']
row_a = dm['A']
# Read from file
dm = DistanceMatrix.read('distances.txt')
# Use in downstream analyses
pcoa_results = pcoa(dm)
permanova_results = permanova(dm, grouping)
Important notes:
DistanceMatrixenforces symmetry and a zero (hollow) diagonal; it is a subclass ofSymmetricMatrixPairwiseMatrix(renamed fromDissimilarityMatrix, which is kept as a deprecated alias) allows general/asymmetric values- IDs enable integration with metadata and biological knowledge
- Compatible with pandas, numpy, and scikit-learn
9. Biological Tables
Work with feature tables (OTU/ASV tables) common in microbiome research.
Key capabilities:
- BIOM format I/O (HDF5 and JSON) via the native
Tableclass - Table dispatch system (0.7.0+): functions accept any
table_likeinput — BIOMTable, pandas/polars DataFrame, NumPy array, or AnnData — without explicit conversion - Data augmentation techniques (
phylomix,mixup,aitchison_mixup,compos_cutmix) - Sample/feature filtering and normalization
- Metadata integration
Common patterns:
from skbio import Table
from skbio.diversity import beta_diversity
# Read BIOM table
table = Table.read('table.biom')
# Access data
sample_ids = table.ids(axis='sample')
feature_ids = table.ids(axis='observation')
counts = table.matrix_data
# Filter
filtered = table.filter(sample_ids_to_keep, axis='sample')
# Pass table-like objects directly to scikit-bio drivers (dispatch system)
import pandas as pd
df = pd.read_table('data.tsv', index_col=0) # samples x features
bdiv = beta_diversity('braycurtis', df) # no manual conversion needed
Important notes:
- BIOM tables are standard in QIIME 2 workflows
- Rows typically represent samples, columns represent features (OTUs/ASVs)
- Supports sparse and dense representations
- With the dispatch system, functions return the same format as their input, or a user-specified output format
10. Protein Embeddings
Work with protein language model embeddings for downstream analysis.
Key capabilities:
- Store embeddings from protein language models (ESM, ProtTrans, etc.)
- Convert embeddings to distance matrices
- Generate ordination objects for visualization
- Export to numpy/pandas for ML workflows
Common patterns:
from skbio.embedding import ProteinEmbedding, ProteinVector
# Create embedding from array
embedding = ProteinEmbedding(embedding_array, sequence_ids)
# Convert to distance matrix for analysis
dm = embedding.to_distances(metric='euclidean')
# PCoA visualization of embedding space
pcoa_results = embedding.to_ordination(metric='euclidean', method='pcoa')
# Export for machine learning
array = embedding.to_array()
df = embedding.to_dataframe()
Important notes:
- Embeddings bridge protein language models with traditional bioinformatics
- Compatible with scikit-bio's distance/ordination/statistics ecosystem
- SequenceEmbedding and ProteinEmbedding provide specialized functionality
- Useful for sequence clustering, classification, and visualization
Best Practices
Installation
uv pip install scikit-bio
Requires Python 3.10+ and NumPy 2.0+. Pre-compiled wheels are published for each release since 0.7.0, so most platforms install without a compiler. Conda users can instead run conda install -c conda-forge scikit-bio.
Performance Considerations
- Use generators for large sequence files to minimize memory usage
- For massive phylogenetic trees, prefer GME or BME over NJ
- Beta diversity calculations can be parallelized with
partial_beta_diversity() - BIOM format (HDF5) more efficient than JSON for large tables
Integration with Ecosystem
- Sequences interoperate with Biopython via standard formats
- Tables integrate with pandas, polars, and AnnData
- Distance matrices compatible with scikit-learn
- Ordination results visualizable with matplotlib/seaborn/plotly
- Works seamlessly with QIIME 2 artifacts (BIOM, trees, distance matrices)
Common Workflows
- Microbiome diversity analysis: Read BIOM table → Calculate alpha/beta diversity → Ordination (PCoA) → Statistical testing (PERMANOVA)
- Phylogenetic analysis: Read sequences → Align → Build distance matrix → Construct tree → Calculate phylogenetic distances
- Sequence processing: Read FASTQ → Quality filter → Trim/clean → Find motifs → Translate → Write FASTA
- Comparative genomics: Read sequences → Pairwise alignment → Calculate distances → Build tree → Analyze clades
Reference Documentation
For detailed API information, parameter specifications, and advanced usage examples, refer to references/api_reference.md which contains comprehensive documentation on:
- Complete method signatures and parameters for all capabilities
- Extended code examples for complex workflows
- Troubleshooting common issues
- Performance optimization tips
- Integration patterns with other libraries
Additional Resources
- Official documentation: https://scikit.bio/docs/latest/
- GitHub repository: https://github.com/scikit-bio/scikit-bio
- Changelog: https://github.com/scikit-bio/scikit-bio/blob/main/CHANGELOG.md
- Reference paper: "scikit-bio: a fundamental Python library for biological omic data," Nature Methods (2025), https://www.nature.com/articles/s41592-025-02981-z
- Forum support: https://forum.qiime2.org (scikit-bio is part of QIIME 2 ecosystem)
How to use scikit-bio on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your development machine
- ›Node.js version 16.0+ with npm package manager (verify with
node --version) - ›Active project directory or workspace where you want to add scikit-bio
Execute installation command
Execute the skills CLI command in your project's root directory to begin installation:
The skills CLI fetches scikit-bio from GitHub repository K-Dense-AI/scientific-agent-skills and configures it for Cursor.
Select Cursor when prompted
The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Reload or restart Cursor to activate scikit-bio. Access the skill through slash commands (e.g., /scikit-bio) or your agent's skill management interface.
Security & Verification Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.
List & Monetize Your Skill
Submit your Claude Code skill and start earning
Use Cases▌
Task Automation & Efficiency
Automate repetitive workflows and reduce manual effort
Example
Generate reports, summarize documents, draft communications
Save 3-5 hours per week on routine tasks
Knowledge Enhancement
Learn new skills, understand complex topics, get expert guidance
Example
Explain concepts, provide examples, suggest learning resources
Accelerate learning and skill development by 2x
Quality Improvement
Enhance output quality through reviews, suggestions, and refinements
Example
Review drafts, suggest improvements, catch errors
Improve work quality by 30-40% with less effort
Implementation Guide▌
Prerequisites
- ›Claude Desktop or compatible AI client with skill support
- ›Clear understanding of task or problem to solve
- ›Willingness to iterate and refine outputs
Time Estimate
15-45 minutes depending on use case complexity
Installation Steps
- 1.Install skill using provided installation command
- 2.Test with simple use case relevant to your work
- 3.Evaluate output quality and relevance
- 4.Iterate on prompts to improve results
- 5.Integrate into regular workflow if valuable
Common Pitfalls
- ⚠Expecting perfect results without iteration
- ⚠Not providing enough context in prompts
- ⚠Using skill for tasks outside its intended scope
- ⚠Accepting outputs without review and validation
Best Practices▌
✓ Do
- +Start with clear, specific prompts
- +Provide relevant context and constraints
- +Review and refine all outputs before using
- +Iterate to improve output quality
- +Document successful prompt patterns
✗ Don't
- −Don't use without understanding skill limitations
- −Don't skip validation of outputs
- −Don't share sensitive information in prompts
- −Don't expect skill to replace human judgment
💡 Pro Tips
- ★Be specific about desired format and style
- ★Ask for multiple options to choose from
- ★Request explanations to understand reasoning
- ★Combine AI efficiency with human expertise
When to Use This▌
✓ Use When
Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.
✗ Avoid When
Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.
Learning Path▌
- 1Familiarize yourself with skill capabilities and limitations
- 2Start with low-risk, non-critical tasks
- 3Progress to more complex and valuable use cases
- 4Build expertise through regular use and experimentation
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.7★★★★★58 reviews- ★★★★★Ganesh Mohane· Dec 24, 2024
I recommend scikit-bio for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Nia Khanna· Dec 24, 2024
Useful defaults in scikit-bio — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Nia Kapoor· Dec 12, 2024
scikit-bio reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Kaira Srinivasan· Dec 8, 2024
scikit-bio has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Hassan Rahman· Nov 27, 2024
scikit-bio reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Rahul Santra· Nov 15, 2024
Useful defaults in scikit-bio — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Ira Gonzalez· Nov 15, 2024
I recommend scikit-bio for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Luis Wang· Nov 3, 2024
scikit-bio has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Camila Wang· Oct 22, 2024
Useful defaults in scikit-bio — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Kaira Liu· Oct 18, 2024
I recommend scikit-bio for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
showing 1-10 of 58