| name | pydeseq2 |
| description | Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis. |
| license | MIT license |
| metadata | version: "1.0" skill-author: K-Dense Inc. |
PyDESeq2
Overview
PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.
When to Use This Skill
This skill should be used when:
- Analyzing bulk RNA-seq count data for differential expression
- Comparing gene expression between experimental conditions (e.g., treated vs control)
- Performing multi-factor designs accounting for batch effects or covariates
- Converting R-based DESeq2 workflows to Python
- Integrating differential expression analysis into Python-based pipelines
- Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"
Quick Start Workflow
For users who want to perform a standard differential expression analysis:
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True
)
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
results = ds.results_df
significant = results[results.padj < 0.05]
print(f"Found {len(significant)} significant genes")
Core Workflow Steps
Step 1: Data Preparation
Input requirements:
- Count matrix: Samples Γ genes DataFrame with non-negative integer read counts
- Metadata: Samples Γ variables DataFrame with experimental factors
Common data loading patterns:
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)
counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names)
metadata = adata.obs
Data filtering:
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]
Step 2: Design Specification
The design formula specifies how gene expression is modeled.
Single-factor designs:
design = "~condition"
Multi-factor designs:
design = "~batch + condition"
design = "~age + condition"
design = "~group + condition + group:condition"
Design formula guidelines:
- Use Wilkinson formula notation (R-style)
- Put adjustment variables (e.g., batch) before the main variable of interest
- Ensure variables exist as columns in the metadata DataFrame
- Use appropriate data types (categorical for discrete variables)
Step 3: DESeq2 Fitting
Initialize the DeseqDataSet and run the complete pipeline:
from pydeseq2.dds import DeseqDataSet
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True,
n_cpus=1
)
dds.deseq2()
What deseq2() does:
- Computes size factors (normalization)
- Fits genewise dispersions
- Fits dispersion trend curve
- Computes dispersion priors
- Fits MAP dispersions (shrinkage)
- Fits log fold changes
- Calculates Cook's distances (outlier detection)
- Refits if outliers detected (optional)
Step 4: Statistical Testing
Perform Wald tests to identify differentially expressed genes:
from pydeseq2.ds import DeseqStats
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"],
alpha=0.05,
cooks_filter=True,
independent_filter=True
)
ds.summary()
Contrast specification:
- Format:
[variable, test_level, reference_level]
- Example:
["condition", "treated", "control"] tests treated vs control
- If
None, uses the last coefficient in the design
Result DataFrame columns:
baseMean: Mean normalized count across samples
log2FoldChange: Log2 fold change between conditions
lfcSE: Standard error of LFC
stat: Wald test statistic
pvalue: Raw p-value
padj: Adjusted p-value (FDR-corrected via Benjamini-Hochberg)
Step 5: Optional LFC Shrinkage
Apply shrinkage to reduce noise in fold change estimates:
ds.lfc_shrink()
When to use LFC shrinkage:
- For visualization (volcano plots, heatmaps)
- For ranking genes by effect size
- When prioritizing genes for follow-up experiments
Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.
Step 6: Result Export
Save results and intermediate objects:
import pickle
ds.results_df.to_csv("deseq2_results.csv")
significant = ds.results_df[ds.results_df.padj < 0.05]
significant.to_csv("significant_genes.csv")
with open("dds_result.pkl", "wb") as f:
pickle.dump(dds.to_picklable_anndata(), f)
Common Analysis Patterns
Two-Group Comparison
Standard case-control comparison:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
results = ds.results_df
significant = results[results.padj < 0.05]
Multiple Comparisons
Testing multiple treatment groups against control:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()
treatments = ["treatment_A", "treatment_B", "treatment_C"]
all_results = {}
for treatment in treatments:
ds = DeseqStats(dds, contrast=["condition", treatment, "control"])
ds.summary()
all_results[treatment] = ds.results_df
sig_count = len(ds.results_df[ds.results_df.padj < 0.05])
print(f"{treatment}: {sig_count} significant genes")
Accounting for Batch Effects
Control for technical variation:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
Continuous Covariates
Include continuous variables like age or dosage:
metadata["age"] = pd.to_numeric(metadata["age"])
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition")
dds.deseq2()
ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
Using the Analysis Script
This skill includes a complete command-line script for standard analyses:
python scripts/run_deseq2_analysis.py \
--counts counts.csv \
--metadata metadata.csv \
--design "~condition" \
--contrast condition treated control \
--output results/
python scripts/run_deseq2_analysis.py \
--counts counts.csv \
--metadata metadata.csv \
--design "~batch + condition" \
--contrast condition treated control \
--output results/ \
--min-counts 10 \
--alpha 0.05 \
--n-cpus 4 \
--plots
Script features:
- Automatic data loading and validation
- Gene and sample filtering
- Complete DESeq2 pipeline execution
- Statistical testing with customizable parameters
- Result export (CSV, pickle)
- Optional visualization (volcano and MA plots)
Refer users to scripts/run_deseq2_analysis.py when they need a standalone analysis tool or want to batch process multiple datasets.
Result Interpretation
Identifying Significant Genes
significant = ds.results_df[ds.results_df.padj < 0.05]
sig_and_large = ds.results_df[
(ds.results_df.padj < 0.05) &
(abs(ds.results_df.log2FoldChange) > 1)
]
upregulated = significant[significant.log2FoldChange > 0]
downregulated = significant[significant.log2FoldChange < 0]
print(f"Upregulated: {len(upregulated)}")
print(f"Downregulated: {len(downregulated)}")
Ranking and Sorting
top_by_padj = ds.results_df.sort_values("padj").head(20)
ds.lfc_shrink()
ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange)
top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)
ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange)
top_combined = ds.results_df.sort_values("score", ascending=False).head(20)
Quality Metrics
print("Size factors:", dds.obsm["size_factors"])
import matplotlib.pyplot as plt
plt.hist(dds.varm["dispersions"], bins=50)
plt.xlabel("Dispersion")
plt.ylabel("Frequency")
plt.title("Dispersion Distribution")
plt.show()
plt.hist(ds.results_df.pvalue.dropna(), bins=50)
plt.xlabel("P-value")
plt.ylabel("Frequency")
plt.title("P-value Distribution")
plt.show()
Visualization Guidelines
Volcano Plot
Visualize significance vs effect size:
import matplotlib.pyplot as plt
import numpy as np
results = ds.results_df.copy()
results["-log10(padj)"] = -np.log10(results.padj)
plt.figure(figsize=(10, 6))
significant = results.padj < 0.05
plt.scatter(
results.loc[~significant, "log2FoldChange"],
results.loc[~significant, "-log10(padj)"],
alpha=0.3, s=10, c='gray', label='Not significant'
)
plt.scatter(
results.loc[significant, "log2FoldChange"],
results.loc[significant, "-log10(padj)"],
alpha=0.6, s=10, c='red', label='padj < 0.05'
)
plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-Log10(Adjusted P-value)")
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)
MA Plot
Show fold change vs mean expression:
plt.figure(figsize=(10, 6))
plt.scatter(
np.log10(results.loc[~significant, "baseMean"] + 1),
results.loc[~significant, "log2FoldChange"],
alpha=0.3, s=10, c='gray'
)
plt.scatter(
np.log10(results.loc[significant, "baseMean"] + 1),
results.loc[significant, "log2FoldChange"],
alpha=0.6, s=10, c='red'
)
plt.axhline(0, color='blue', linestyle='--', alpha=0.5)
plt.xlabel("Log10(Base Mean + 1)")
plt.ylabel("Log2 Fold Change")
plt.title("MA Plot")
plt.savefig("ma_plot.png", dpi=300)
Troubleshooting Common Issues
Data Format Problems
Issue: "Index mismatch between counts and metadata"
Solution: Ensure sample names match exactly
print("Counts samples:", counts_df.index.tolist())
print("Metadata samples:", metadata.index.tolist())
common = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common]
metadata = metadata.loc[common]
Issue: "All genes have zero counts"
Solution: Check if data needs transposition
print(f"Counts shape: {counts_df.shape}")
if counts_df.shape[1] < counts_df.shape[0]:
counts_df = counts_df.T
Design Matrix Issues
Issue: "Design matrix is not full rank"
Cause: Confounded variables (e.g., all treated samples in one batch)
Solution: Remove confounded variable or add interaction term
print(pd.crosstab(metadata.condition, metadata.batch))
design = "~condition"
design = "~condition + batch + condition:batch"
No Significant Genes
Diagnostics:
plt.hist(dds.varm["dispersions"], bins=50)
plt.show()
print(dds.obsm["size_factors"])
print(ds.results_df.nsmallest(20, "pvalue"))
Possible causes:
- Small effect sizes
- High biological variability
- Insufficient sample size
- Technical issues (batch effects, outliers)
Reference Documentation
For comprehensive details beyond this workflow-oriented guide:
-
API Reference (references/api_reference.md): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes.
-
Workflow Guide (references/workflow_guide.md): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues.
Load these references into context when users need:
- Detailed API documentation:
Read references/api_reference.md
- Comprehensive workflow examples:
Read references/workflow_guide.md
- Troubleshooting guidance:
Read references/workflow_guide.md (see Troubleshooting section)
Key Reminders
-
Data orientation matters: Count matrices typically load as genes Γ samples but need to be samples Γ genes. Always transpose with .T if needed.
-
Sample filtering: Remove samples with missing metadata before analysis to avoid errors.
-
Gene filtering: Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time.
-
Design formula order: Put adjustment variables before the variable of interest (e.g., "~batch + condition" not "~condition + batch").
-
LFC shrinkage timing: Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates.
-
Result interpretation: Use padj < 0.05 for significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate.
-
Contrast specification: The format is [variable, test_level, reference_level] where test_level is compared against reference_level.
-
Save intermediate objects: Use pickle to save DeseqDataSet objects for later use or additional analyses without re-running the expensive fitting step.
Installation and Requirements
uv pip install pydeseq2
System requirements:
- Python 3.10-3.11
- pandas 1.4.3+
- numpy 1.23.0+
- scipy 1.11.0+
- scikit-learn 1.1.1+
- anndata 0.8.0+
Optional for visualization:
Additional Resources