| name | protein-sequence-similarity-search |
| description | > Searches for homologous protein sequences using MMseqs2 (fast, default) or BLAST (comprehensive, fallback). Trigger this whenever the user provides a protein sequence or FASTA file and asks to find homologues, sequence matches, or wants to infer protein function based on sequence similarity, but not when the user wants to infer protein function based on structural similarity. |
Prerequisites
-
uv: Read the uv skill and follow its Setup instructions to ensure
uv is installed and on PATH.
-
User Notification: If LICENSE_NOTIFICATION.txt does not already exist in
this skill directory then (1) prominently notify the user to check the terms
at https://www.ebi.ac.uk/jdispatcher/sss/ncbiblast and
https://colabfold.com, then (2) create the file recording the notification
text and timestamp.
-
.env file: Make sure the .env file exists in your home directory.
Create one if it does not exist.
-
USER_EMAIL (optional but recommended): Recommended by the EBI for
BLAST job tracking, but the skill works without it. If the variable is
missing from .env, do NOT ask the user to paste it into the chat (this
would leak the value into the agent's context). Instead, give the user this
command โ substituting ENV_FILE with the resolved literal path to the
.env file:
printf "Enter contact email: " && read email && echo "USER_EMAIL=$email" >> "ENV_FILE" && echo "Saved."
The scripts load credentials automatically via dotenv. NEVER read,
print, or inspect the .env file or its variables (e.g. no cat, grep,
echo, printenv, or os.environ.get on keys). Credentials must stay out
of the agent's context.
Goal
Take a user-provided amino acid sequence (or a path to a .fasta file), search
for sequence homologues using the fastest available method, generate a
Markdown-formatted table of the top hits, interpret key alignment metrics,
summarize the inferred protein functions, and save results locally for future
programmatic analysis.
Core Rules
- Strict Validation: For BLAST, only use database codes listed in the
table below.
- No Hallucinations: If a script throws an error or returns no hits,
inform the user clearly. Do NOT invent sequence homologues.
- Do Not Parse Output Files: Do not parse the JSON, a3m, or any other raw
output files. Rely on the generated
.md file for your summary. The JSON
and other outputs are for subsequent tool use only.
- Always State the Method: Every report must clearly state whether the
search used the quick MMseqs2 (ColabFold API) or the slower EBI BLAST
method.
- Notification: If this skill is used, ensure this is mentioned in the
output. Explicitly state that the corresponding program (MMSEQS2 or EBI
BLAST) and Sequence Databases were used.
Search Method Selection
Choose the search method based on the user's request:
If the user says "quick search" or "fast search", no specific method
requested / general homologue search, of if you are unsure: Run MMseqs2 (fast,
default) using mmseqs2_search.py
If MMseqs2 fails (exit code 2: RATELIMIT or API error) or User explicitly
requests "BLAST" or a specific BLAST database (e.g. uniprotkb_swissprot,
pdb, uniprotkb_human): Run BLAST using uniprot_blast.py
Instructions
-
Identify the query from the user. It can be a raw sequence string (e.g.,
"MKVLY...") or a path to a local file (e.g., "./data/sequence.fasta").
-
Determine the search method using the list above.
Path A: MMseqs2 Search (Default)
-
Generate File Names: Generate descriptive output file names based on the
input (e.g., proteinA_mmseqs2.json and proteinA_mmseqs2.md).
-
Execute the MMseqs2 script:
uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>
uv run scripts/mmseqs2_search.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --include-mgnify
-
The script will query the ColabFold MMseqs2 API and poll for completion.
This is typically fast (under 2 minutes).
-
If the script exits with code 2 (API failure, rate limit), automatically
fall back to BLAST (Path B below). Inform the user: "MMseqs2 search failed,
falling back to BLAST."
-
Read the Results: Open and read the generated .md file.
Path B: BLAST Search (Explicit or Fallback)
-
Database Selection & Validation: Determine the most appropriate
database(s) based on the user's prompt.
- Consult the Available BLAST Databases table below.
- If the user specifies a taxonomic group (e.g., "Find homologues in
microbes"), select the corresponding
Database Code (e.g.,
uniprotkb_bacteria).
- If the user explicitly requests curated hits, use
uniprotkb_swissprot.
- If no specific database is requested, do not specify
--databases.
- Validation: Ensure the database code exactly matches an entry in the
table. If the user requests a database not on the list, do not
proceed and provide the allowed list.
-
Generate File Names: (e.g., proteinA_ebi_blast.json and
proteinA_ebi_blast.md).
-
This API requires the user email address to be set in the USER_EMAIL
environment variable for inclusion in request header.
-
Execute the BLAST script:
uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json>
uv run scripts/uniprot_blast.py <SEQUENCE_OR_FILE> -o <generated-filename.md> -j <generated-filename.json> --databases <db1,db2>
-
The script will query the EBI BLAST API and poll the server. Note: This
can take up to 15 minutes; wait patiently.
-
Read the Results: Open and read the generated .md file.
Common Steps (Both Methods)
- Interpret the Metrics: Summarize the top 3 to 5 sequence homologues.
Assess match quality using:
- Q-Cov (Query Coverage): High percentages mean the match covers most
of the query sequence.
- E-value: Lower E-values (e.g.,
1e-50) indicate extreme statistical
significance.
- Seq Identity: Provides evolutionary context (highly conserved vs.
distant homologue).
- Perform Functional Analysis:
- If the results table includes protein descriptions, analyze them
directly: report specific protein names/functions of the top homologues
and summarize the variety of functions, domains, or protein families
found.
- If the results contain only UniProt accession IDs without descriptions
(common with MMseqs2), look up the protein names and functions for the
top 3โ5 hits using the uniprot-database skill or other appropriate
methods before summarizing.
- Inform the user of both newly created files (
.json and .md) and their
locations.
Available BLAST Databases
uniprotkb โ UniProt Knowledgebase (The UniProt Knowledgebase includes
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL): The UniProt Knowledgebase
(UniProtKB) is the central access point for extensive curated protein
information, including function, classification, and cross-references.
Search UniProtKB to retrieve "everything that is known" about a particular
sequence
uniprotkb_swissprot โ UniProtKB/Swiss-Prot (The manually annotated section
of UniProtKB): The manually curated subsection of the UniProt Knowledgebase
uniprotkb_swissprotsv โ UniProtKB/Swiss-Prot isoforms (The manually
annotated isoforms of UniProtKB/Swiss-Prot): The isoform sequences for the
manually curated subsection of the UniProt Knowledgebase
uniprotkb_reference_proteomes โ UniProtKB Reference Proteomes: Taxonomic
subset of the UniProtKB Reference Proteomes
uniprotkb_trembl โ UniProtKB/TrEMBL (The automatically annotated section
of UniProtKB): Subsection of the UniProt Knowledgebase derived from ENA
Sequence (formerly EMBL-Bank) coding sequence translations with annotation
produced by an automated process
uniprotkb_refprotswissprot โ UniProtKB Reference Proteomes plus
Swiss-Prot: UniProtKB Reference Proteomes plus Swiss-Prot
uniprotkb_archaea โ UniProtKB Archaea: Taxonomic subset of the UniProt
Knowledgebase for archaea
uniprotkb_arthropoda โ UniProtKB Arthropoda: Taxonomic subset of the
UniProt Knowledgebase for arthropoda
uniprotkb_bacteria โ UniProtKB Bacteria: Taxonomic subset of the UniProt
Knowledgebase for bacteria
uniprotkb_complete_microbial_proteomes โ UniProtKB Complete Microbial
Proteomes: Taxonomic subset of the UniProt Knowledgebase for complete
microbial proteomes
uniprotkb_eukaryota โ UniProtKB Eukaryota: Taxonomic subset of the UniProt
Knowledgebase for eukaryota
uniprotkb_fungi โ UniProtKB Fungi: Taxonomic subset of the UniProt
Knowledgebase for fungi
uniprotkb_human โ UniProtKB Human: Taxonomic subset of the UniProt
Knowledgebase for human
uniprotkb_mammals โ UniProtKB Mammals: Taxonomic subset of the UniProt
Knowledgebase for mammals
uniprotkb_nematoda โ UniProtKB Nematoda: Taxonomic subset of the UniProt
Knowledgebase for nematoda
uniprotkb_rodents โ UniProtKB Rodents: Taxonomic subset of the UniProt
Knowledgebase for rodents
uniprotkb_vertebrates โ UniProtKB Vertebrates: Taxonomic subset of the
UniProt Knowledgebase for vertebrates
uniprotkb_viridiplantae โ UniProtKB Viridiplantae: Taxonomic subset of the
UniProt Knowledgebase for viridiplantae
uniprotkb_viruses โ UniProtKB Viruses: Taxonomic subset of the UniProt
Knowledgebase for viruses
uniprotkb_enzyme โ UniProtKB Enzyme: Taxonomic subset of the UniProt
Knowledgebase for enzymes
uniprotkb_covid19 โ UniProtKB COVID-19: Taxonomic subset of the UniProt
Knowledgebase for COVID-19
uniref100 โ UniProt Clusters 100% (UniRef100): The UniProt Reference
Clusters (UniRef) containing sequences which are 100% identical.
uniref90 โ UniProt Clusters 90% (UniRef90): The UniProt Reference Clusters
(UniRef) containing sequences which are 90% identical.
uniref50 โ UniProt Clusters 50% (UniRef50): The UniProt Reference Clusters
(UniRef) containing sequences which are 50% identical.
pdb โ Protein Structure Sequences (PDBe protein structure sequences):
Protein sequences from structures described in the Brookhaven Protein Data
Bank (PDB)