DiffDock: Molecular Docking with Diffusion Models
Overview
DiffDock is a diffusion-based deep learning tool for molecular docking that predicts 3D binding poses of small molecule ligands to protein targets. It represents the state-of-the-art in computational docking, crucial for structure-based drug discovery and chemical biology.
Core Capabilities:
- Predict ligand binding poses with high accuracy using deep learning
- Support protein structures (PDB files) or sequences (via ESMFold)
- Process single complexes or batch virtual screening campaigns
- Generate confidence scores to assess prediction reliability
- Handle diverse ligand inputs (SMILES, SDF, MOL2)
Key Distinction: DiffDock predicts binding poses (3D structure) and confidence (prediction certainty), NOT binding affinity (ΞG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
When to Use This Skill
This skill should be used when:
- "Dock this ligand to a protein" or "predict binding pose"
- "Run molecular docking" or "perform protein-ligand docking"
- "Virtual screening" or "screen compound library"
- "Where does this molecule bind?" or "predict binding site"
- Structure-based drug design or lead optimization tasks
- Tasks involving PDB files + SMILES strings or ligand structures
- Batch docking of multiple protein-ligand pairs
Installation and Environment Setup
Check Environment Status
Before proceeding with DiffDock tasks, verify the environment setup:
python scripts/setup_check.py
This script validates Python version, PyTorch with CUDA, PyTorch Geometric, RDKit, ESM, and other dependencies.
Installation Options
Option 1: Conda (Recommended)
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
conda env create --file environment.yml
conda activate diffdock
Option 2: Docker
docker pull rbgcsail/diffdock
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
micromamba activate diffdock
Important Notes:
- GPU strongly recommended (10-100x speedup vs CPU)
- First run pre-computes SO(2)/SO(3) lookup tables (~2-5 minutes)
- Model checkpoints (~500MB) download automatically if not present
Core Workflows
Workflow 1: Single Protein-Ligand Docking
Use Case: Dock one ligand to one protein target
Input Requirements:
- Protein: PDB file OR amino acid sequence
- Ligand: SMILES string OR structure file (SDF/MOL2)
Command:
python -m inference \
--config default_inference_args.yaml \
--protein_path protein.pdb \
--ligand "CC(=O)Oc1ccccc1C(=O)O" \
--out_dir results/single_docking/
Alternative (protein sequence):
python -m inference \
--config default_inference_args.yaml \
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKF..." \
--ligand ligand.sdf \
--out_dir results/sequence_docking/
Output Structure:
results/single_docking/
βββ rank_1.sdf # Top-ranked pose
βββ rank_2.sdf # Second-ranked pose
βββ ...
βββ rank_10.sdf # 10th pose (default: 10 samples)
βββ confidence_scores.txt
Workflow 2: Batch Processing Multiple Complexes
Use Case: Dock multiple ligands to proteins, virtual screening campaigns
Step 1: Prepare Batch CSV
Use the provided script to create or validate batch input:
python scripts/prepare_batch_csv.py --create --output batch_input.csv
python scripts/prepare_batch_csv.py my_input.csv --validate
CSV Format:
complex_name,protein_path,ligand_description,protein_sequence
complex1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
complex2,,COc1ccc(C#N)cc1,MSKGEELFT...
complex3,protein3.pdb,ligand3.sdf,
Required Columns:
complex_name: Unique identifier
protein_path: PDB file path (leave empty if using sequence)
ligand_description: SMILES string or ligand file path
protein_sequence: Amino acid sequence (leave empty if using PDB)
Step 2: Run Batch Docking
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv batch_input.csv \
--out_dir results/batch/ \
--batch_size 10
For Large Virtual Screening (>100 compounds):
Pre-compute protein embeddings for faster processing:
python datasets/esm_embedding_preparation.py \
--protein_ligand_csv screening_input.csv \
--out_file protein_embeddings.pt
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv screening_input.csv \
--esm_embeddings_path protein_embeddings.pt \
--out_dir results/screening/
Workflow 3: Analyzing Results
After docking completes, analyze confidence scores and rank predictions:
python scripts/analyze_results.py results/batch/
python scripts/analyze_results.py results/batch/ --top 5
python scripts/analyze_results.py results/batch/ --threshold 0.0
python scripts/analyze_results.py results/batch/ --export summary.csv
python scripts/analyze_results.py results/batch/ --best 20
The analysis script:
- Parses confidence scores from all predictions
- Classifies as High (>0), Moderate (-1.5 to 0), or Low (<-1.5)
- Ranks predictions within and across complexes
- Generates statistical summaries
- Exports results to CSV for downstream analysis
Confidence Score Interpretation
Understanding Scores:
| Score Range |
Confidence Level |
Interpretation |
| > 0 |
High |
Strong prediction, likely accurate |
| -1.5 to 0 |
Moderate |
Reasonable prediction, validate carefully |
| < -1.5 |
Low |
Uncertain prediction, requires validation |
Critical Notes:
- Confidence β Affinity: High confidence means model certainty about structure, NOT strong binding
- Context Matters: Adjust expectations for:
- Large ligands (>500 Da): Lower confidence expected
- Multiple protein chains: May decrease confidence
- Novel protein families: May underperform
- Multiple Samples: Review top 3-5 predictions, look for consensus
For detailed guidance: Read references/confidence_and_limitations.md using the Read tool
Parameter Customization
Using Custom Configuration
Create custom configuration for specific use cases:
cp assets/custom_inference_config.yaml my_config.yaml
python -m inference \
--config my_config.yaml \
--protein_ligand_csv input.csv \
--out_dir results/
Key Parameters to Adjust
Sampling Density:
samples_per_complex: 10 β Increase to 20-40 for difficult cases
- More samples = better coverage but longer runtime
Inference Steps:
inference_steps: 20 β Increase to 25-30 for higher accuracy
- More steps = potentially better quality but slower
Temperature Parameters (control diversity):
temp_sampling_tor: 7.04 β Increase for flexible ligands (8-10)
temp_sampling_tor: 7.04 β Decrease for rigid ligands (5-6)
- Higher temperature = more diverse poses
Presets Available in Template:
- High Accuracy: More samples + steps, lower temperature
- Fast Screening: Fewer samples, faster
- Flexible Ligands: Increased torsion temperature
- Rigid Ligands: Decreased torsion temperature
For complete parameter reference: Read references/parameters_reference.md using the Read tool
Advanced Techniques
Ensemble Docking (Protein Flexibility)
For proteins with known flexibility, dock to multiple conformations:
import pandas as pd
conformations = ["conf1.pdb", "conf2.pdb", "conf3.pdb"]
ligand = "CC(=O)Oc1ccccc1C(=O)O"
data = {
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
"protein_path": conformations,
"ligand_description": [ligand] * len(conformations),
"protein_sequence": [""] * len(conformations)
}
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
Run docking with increased sampling:
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv ensemble_input.csv \
--samples_per_complex 20 \
--out_dir results/ensemble/
Integration with Scoring Functions
DiffDock generates poses; combine with other tools for affinity:
GNINA (Fast neural network scoring):
for pose in results/*.sdf; do
gnina -r protein.pdb -l "$pose" --score_only
done
MM/GBSA (More accurate, slower):
Use AmberTools MMPBSA.py or gmx_MMPBSA after energy minimization
Free Energy Calculations (Most accurate):
Use OpenMM + OpenFE or GROMACS for FEP/TI calculations
Recommended Workflow:
- DiffDock β Generate poses with confidence scores
- Visual inspection β Check structural plausibility
- GNINA or MM/GBSA β Rescore and rank by affinity
- Experimental validation β Biochemical assays
Limitations and Scope
DiffDock IS Designed For:
- Small molecule ligands (typically 100-1000 Da)
- Drug-like organic compounds
- Small peptides (<20 residues)
- Single or multi-chain proteins
DiffDock IS NOT Designed For:
- Large biomolecules (protein-protein docking) β Use DiffDock-PP or AlphaFold-Multimer
- Large peptides (>20 residues) β Use alternative methods
- Covalent docking β Use specialized covalent docking tools
- Binding affinity prediction β Combine with scoring functions
- Membrane proteins β Not specifically trained, use with caution
For complete limitations: Read references/confidence_and_limitations.md using the Read tool
Troubleshooting
Common Issues
Issue: Low confidence scores across all predictions
- Cause: Large/unusual ligands, unclear binding site, protein flexibility
- Solution: Increase
samples_per_complex (20-40), try ensemble docking, validate protein structure
Issue: Out of memory errors
- Cause: GPU memory insufficient for batch size
- Solution: Reduce
--batch_size 2 or process fewer complexes at once
Issue: Slow performance
- Cause: Running on CPU instead of GPU
- Solution: Verify CUDA with
python -c "import torch; print(torch.cuda.is_available())", use GPU
Issue: Unrealistic binding poses
- Cause: Poor protein preparation, ligand too large, wrong binding site
- Solu