gget
Overview
gget is a command-line bioinformatics tool and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequence analysis, protein structures, expression data, and disease associations through a consistent interface. All gget modules work both as command-line tools and as Python functions.
Important: The databases queried by gget are continuously updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary.
Installation
Install gget in a clean virtual environment to avoid conflicts:
uv uv pip install gget
uv pip install --upgrade gget
import gget
Quick Start
Basic usage pattern for all modules:
gget <module> [arguments] [options]
gget.module(arguments, options)
Most modules return:
- Command-line: JSON (default) or CSV with
-csv flag
- Python: DataFrame or dictionary
Common flags across modules:
-o/--out: Save results to file
-q/--quiet: Suppress progress information
-csv: Return CSV format (command-line only)
Module Categories
1. Reference & Gene Information
gget ref - Reference Genome Downloads
Retrieve download links and metadata for Ensembl reference genomes.
Parameters:
species: Genus_species format (e.g., 'homo_sapiens', 'mus_musculus'). Shortcuts: 'human', 'mouse'
-w/--which: Specify return types (gtf, cdna, dna, cds, cdrna, pep). Default: all
-r/--release: Ensembl release number (default: latest)
-l/--list_species: List available vertebrate species
-liv/--list_iv_species: List available invertebrate species
-ftp: Return only FTP links
-d/--download: Download files (requires curl)
Examples:
gget ref --list_species
gget ref homo_sapiens
gget ref -w gtf -d mouse
gget.ref("homo_sapiens")
gget.ref("mus_musculus", which="gtf", download=True)
gget search - Gene Search
Locate genes by name or description across species.
Parameters:
searchwords: One or more search terms (case-insensitive)
-s/--species: Target species (e.g., 'homo_sapiens', 'mouse')
-r/--release: Ensembl release number
-t/--id_type: Return 'gene' (default) or 'transcript'
-ao/--andor: 'or' (default) finds ANY searchword; 'and' requires ALL
-l/--limit: Maximum results to return
Returns: ensembl_id, gene_name, ensembl_description, ext_ref_description, biotype, URL
Examples:
gget search -s human gaba gamma-aminobutyric
gget search -s mouse -ao and pax7 transcription
gget.search(["gaba", "gamma-aminobutyric"], species="homo_sapiens")
gget info - Gene/Transcript Information
Retrieve comprehensive gene and transcript metadata from Ensembl, UniProt, and NCBI.
Parameters:
ens_ids: One or more Ensembl IDs (also supports WormBase, Flybase IDs). Limit: ~1000 IDs
-n/--ncbi: Disable NCBI data retrieval
-u/--uniprot: Disable UniProt data retrieval
-pdb: Include PDB identifiers (increases runtime)
Returns: UniProt ID, NCBI gene ID, primary gene name, synonyms, protein names, descriptions, biotype, canonical transcript
Examples:
gget info ENSG00000034713 ENSG00000104853 ENSG00000170296
gget info ENSG00000034713 -pdb
gget.info(["ENSG00000034713", "ENSG00000104853"], pdb=True)
gget seq - Sequence Retrieval
Fetch nucleotide or amino acid sequences for genes and transcripts.
Parameters:
ens_ids: One or more Ensembl identifiers
-t/--translate: Fetch amino acid sequences instead of nucleotide
-iso/--isoforms: Return all transcript variants (gene IDs only)
Returns: FASTA format sequences
Examples:
gget seq ENSG00000034713 ENSG00000104853
gget seq -t -iso ENSG00000034713
gget.seq(["ENSG00000034713"], translate=True, isoforms=True)
2. Sequence Analysis & Alignment
gget blast - BLAST Searches
BLAST nucleotide or amino acid sequences against standard databases.
Parameters:
sequence: Sequence string or path to FASTA/.txt file
-p/--program: blastn, blastp, blastx, tblastn, tblastx (auto-detected)
-db/--database:
- Nucleotide: nt, refseq_rna, pdbnt
- Protein: nr, swissprot, pdbaa, refseq_protein
-l/--limit: Max hits (default: 50)
-e/--expect: E-value cutoff (default: 10.0)
-lcf/--low_comp_filt: Enable low complexity filtering
-mbo/--megablast_off: Disable MegaBLAST (blastn only)
Examples:
gget blast MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR
gget blast sequence.fasta -db swissprot -l 10
gget.blast("MKWMFK...", database="swissprot", limit=10)
gget blat - BLAT Searches
Locate genomic positions of sequences using UCSC BLAT.
Parameters:
sequence: Sequence string or path to FASTA/.txt file
-st/--seqtype: 'DNA', 'protein', 'translated%20RNA', 'translated%20DNA' (auto-detected)
-a/--assembly: Target assembly (default: 'human'/hg38; options: 'mouse'/mm39, 'zebrafinch'/taeGut2, etc.)
Returns: genome, query size, alignment positions, matches, mismatches, alignment percentage
Examples:
gget blat ATCGATCGATCGATCG
gget blat -a mm39 ATCGATCGATCGATCG
gget.blat("ATCGATCGATCGATCG", assembly="mouse")
gget muscle - Multiple Sequence Alignment
Align multiple nucleotide or amino acid sequences using Muscle5.
Parameters:
fasta: Sequences or path to FASTA/.txt file
-s5/--super5: Use Super5 algorithm for faster processing (large datasets)
Returns: Aligned sequences in ClustalW format or aligned FASTA (.afa)
Examples:
gget muscle sequences.fasta -o aligned.afa
gget muscle large_dataset.fasta -s5
gget.muscle("sequences.fasta", save=True)
gget diamond - Local Sequence Alignment
Perform fast local protein or translated DNA alignment using DIAMOND.
Parameters:
- Query: Sequences (string/list) or FASTA file path
--reference: Reference sequences (string/list) or FASTA file path (required)
--sensitivity: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive (default), ultra-sensitive
--threads: CPU threads (default: 1)
--diamond_db: Save database for reuse
--translated: Enable nucleotide-to-amino acid alignment
Returns: Identity percentage, sequence lengths, match positions, gap openings, E-values, bit scores
Examples:
gget diamond GGETISAWESQME -ref reference.fasta --threads 4
gget diamond query.fasta -ref ref.fasta --diamond_db my_db.dmnd
gget.diamond("GGETISAWESQME", reference="reference.fasta", threads=4)
3. Structural & Protein Analysis
gget pdb - Protein Structures
Query RCSB Protein Data Bank for structure and metadata.
Parameters:
pdb_id: PDB identifier (e.g., '7S7U')
-r/--resource: Data type (pdb, entry, pubmed, assembly, entity types)
-i/--identifier: Assembly, entity, or chain ID
Returns: PDB format (structures) or JSON (metadata)
Examples:
gget pdb 7S7U -o 7S7U.pdb
gget pdb 7S7U -r entry
gget.pdb("7S7U", save=True)
gget alphafold - Protein Structure Prediction
Predict 3D protein structures using simplified AlphaFold2.
Setup Required:
uv pip install openmm
gget setup alphafold
Parameters:
sequence: Amino acid sequence (string), multiple sequences (list), or FASTA file. Multiple sequences trigger multimer modeling
-mr/--multimer_recycles: Recycling iterations (default: 3; recommend 20 for accuracy)
-mfm/--multimer_for_monomer: Apply multimer model to single proteins
-r/--relax: AMBER relaxation for top-ranked model
plot: Python-only; generate interactive 3D visualization (default: True)
show_sidechains: Python-only; include side chains (default: True)
Returns: PDB structure file, JSON alignment error data, optional 3D visualization
Examples:
gget alphafold MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR
gget alphafold sequence1.fasta -mr 20 -r
gget.alphafold("MKWMFK...", plot=True, show_sidechains=True)
gget.alphafold(["sequence1", "sequence2"], multimer_recycles=20)
gget elm - Eukaryotic Linear Motifs
Predict Eukaryotic Linear Motifs in protein sequences.
Setup Required:
gget setup elm
Parameters:
sequence: Amino acid sequence or UniProt Acc
-u/--uniprot: Indicates sequence is UniProt Acc
-e/--expand: Include protein names, organisms, references
-s/--sensitivity: DIAMOND alignment sensitivity (default: "very-sensitive")
-t/--threads: Number of threads (default: 1)
Returns: Two outputs:
- ortholog_df: Linear motifs from orthologous proteins
- regex_df: Motifs directly matched in input sequence
Examples: