| name | ncbi-sequence-fetch |
| description | > Retrieve protein and nucleotide sequences from NCBI databases using E-utilities. Supports direct accession lookup, CDS translation, gene+organism search, locus lookup, PubMed-linked sequences, patent protein extraction, and organism+length fallback search. Use when you need to fetch biological sequences by accession, gene name, locus tag, PubMed ID, or patent number. |
NCBI Sequence Fetch
Prerequisites
-
uv: Read the uv skill and follow its Setup instructions to ensure
uv is installed and on PATH.
-
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in
this skill directory then (1) prominently notify the user to check the terms
at https://www.ncbi.nlm.nih.gov/ and
https://www.ncbi.nlm.nih.gov/home/about/policies/, then (2) create the file
recording the notification text and timestamp.
-
.env file: Make sure the .env file exists in your home directory.
Create one if it does not exist.
-
NCBI_API_KEY (optional): Raises the NCBI rate limit from 3 to 10
requests/second. The skill works without it, but a key is recommended if the
user plans many queries or encounters a 429 error. The user can obtain one
for free by registering at https://www.ncbi.nlm.nih.gov/account/settings/.
If the variable is missing from .env, do NOT ask the user to paste it into
the chat (this would leak the key into the agent's context). Instead, give
the user this command β substituting ENV_FILE with the resolved literal
path to the .env file:
printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
The scripts load credentials automatically via dotenv. NEVER read,
print, or inspect the .env file or its variables (e.g. no cat, grep,
echo, printenv, or os.environ.get on keys). Credentials must stay out
of the agent's context.
Core Rules
- Use the Wrapper: ALWAYS execute the provided helper scripts to query the
database rather than accessing the database directly. The scripts
automatically enforce the required rate limit gracefully.
- API Key Support: If the user provides an
NCBI_API_KEY in their
environment, the query speed limits are automatically increased
significantly.
- Notification: If this skill is used, ensure this is mentioned in the
output.
Overview
Wraps NCBI's Entrez E-utilities (efetch, esearch, elink, esummary) for
retrieving protein and nucleotide sequences. Provides 10 subcommands covering
the full range of sequence retrieval workflows:
fetch-protein β Direct protein accession lookup (GenPept, RefSeq)
fetch-nucleotide β Direct nucleotide accession lookup
cds-translate β Fetch CDS and translate to protein (3 methods)
search β Free-text search of any NCBI database
elink β Follow cross-database links (PubMedβProtein, etc.)
gene-protein β Search protein by gene name + organism
locus-protein β Search protein by locus tag + organism
pubmed-proteins β Find proteins linked to a PubMed article
patent-search β Extract protein sequences from patents
organism-length β Last-resort search by organism + exact AA length
Utility Scripts
scripts/ncbi_fetch.py β Single script with subcommands.
All subcommands write structured JSON output. Use --output FILE to save to a
file, or omit it to print to stdout. A human-readable summary is always printed
to stdout.
1. Fetch Protein by Accession
Fetches protein FASTA from NCBI by accession (XP_, NP_, GenPept, etc.)
uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1
2. Fetch Nucleotide by Accession
Fetches nucleotide FASTA from NCBI by accession.
uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json
3. CDS Translate
Fetches a CDS/nucleotide accession and translates to protein sequence. Tries
three approaches in order: 1. NCBI's pre-translated CDS protein (fasta_cds_aa)
2. GenBank XML CDS annotation translations 3. Raw nucleotide β 6-frame ORF
finding
uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043
If the accession is a genomic record (not mRNA/CDS), the tool will report
is_genomic: true so you can fall back to a homology-based approach instead.
4. Search Any Database
Free-text search using Entrez query syntax. Supports all NCBI databases.
uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]" \
--database protein --retmax 5 --fetch-sequences
uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]" \
--database nuccore --retmax 10
uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]" \
--database protein --fetch-sequences
uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]' \
--database protein --fetch-sequences --retmax 50
5. Cross-Database Links (elink)
Follow NCBI's cross-database links (e.g., PubMed article β linked proteins).
uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
--fetch-sequences -o /tmp/linked.json
6. Gene + Organism Search
Searches for protein sequences by gene name and organism. Searches NCBI Protein
with [Gene Name] and [Organism] qualifiers.
uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
--target-length 1043 -o /tmp/result.json
7. Locus Tag Search
Searches by locus tag in both NCBI Protein and Nuccore databases. Extracts CDS
translations from GenBank XML when direct protein hits aren't available.
uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
--organism "Nicotiana benthamiana" -o /tmp/result.json
8. PubMed-Linked Proteins
Finds protein sequences linked to a PubMed article. Searches NCBI Protein by
PMID, follows elink PubMedβProtein, and extracts CDS translations from linked
Nuccore records.
uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
-o /tmp/result.json
9. Patent Sequence Search
Two modes:
By patent number β fetches all protein sequences from a specific patent:
bash uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json
By keywords β searches NCBI Protein with patent[Properties] filter: bash uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json
[!IMPORTANT] Patent convention: In molecular biology patents, SEQ ID NO: 1
is typically the DNA sequence and SEQ ID NO: 2 is the primary protein. Higher
SEQ ID NOs are variants or related sequences. Prefer Sequence 2 when selecting
the primary protein of interest.
10. Organism + Length Search
Last-resort search when only organism and expected protein length are known.
Uses NCBI's [SLEN] filter for exact length matching.
uv run scripts/ncbi_fetch.py organism-length \
--organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
-o /tmp/result.json
[!NOTE] This often returns multiple candidates. Use the JSON output headers to
identify the correct protein.
Workflow
Standard Sequence Retrieval Cascade
When trying to find a protein sequence, follow this priority order:
- Direct accession β
fetch-protein with GenPept/RefSeq accession
- CDS translation β
cds-translate with nucleotide/CDS accession
- PubMed-linked β
pubmed-proteins with PMID + gene name
- Locus lookup β
locus-protein with locus tag + organism
- Gene + organism β
gene-protein with gene name + organism
- Patent search β
patent-search with patent number or keywords
- Organism + length β
organism-length as last resort
Interpreting Results
- All subcommands return JSON with a
results array
- Each result has
sequence (AA string), length, and header/metadata
- When multiple results are returned, select by:
- Closest match to expected length (
target_length)
- Header relevance (matching gene name, "disease resistance" keywords)
- Source priority (RefSeq > GenPept > patent)
Reference
- NCBI E-utilities docs: https://www.ncbi.nlm.nih.gov/books/NBK25499/
- Entrez search syntax: https://www.ncbi.nlm.nih.gov/books/NBK49540/
- Database list: protein, nuccore, gene, pubmed, pmc, biosample, etc.
- Common accession formats:
XP_ / NP_ β NCBI RefSeq protein
AAA to AZZ + digits β GenPept (translated GenBank)
MK, MN, HQ, etc. + digits β GenBank nucleotide
ENSG, ENST, ENSP β Ensembl (use ensembl-database skill instead)
Q, P, O + digits β UniProt (use uniprot-database skill instead)