VirBench is a benchmark of 120 manually verified viral sequence retrieval queries spanning 40 pathogens, created by Laura Luebbert's team at Anthropic and collaborators. It tests whether AI agents can reproduce NCBI Virus web-interface filters programmatically—with ground-truth counts for surveillance, diagnostic assay design, and training-data construction.

How did AI agents perform on VirBench without gget virus?

Mean accuracy ranged from 16.9% (Claude Sonnet 4) to 91.3% (GPT-5.5) across Biomni OSS, Edison Analysis, Claude, and GPT agents. The same model often returned different counts on identical prompts across three runs—unacceptable for scientific dataset construction where the bar is effectively 100%.

What is gget virus and how much did it improve agent accuracy?

gget virus is a deterministic command-line tool that reproduces NCBI Virus filtering across REST, Datasets, and E-utilities APIs. When agents used it, accuracy rose above 90% for all systems and peaked at 99.7% for GPT-5.5, with run-to-run variability largely eliminated per the arXiv preprint (2606.06749).

Why do coding agents advance faster than biological agents?

Software infrastructure offers version control, documented APIs, package managers, and testable outputs (e.g., a patch that passes CI). Biological databases like NCBI Virus were designed for browser workflows with implicit conventions, scattered APIs, and filtering logic that lives only in web UIs.

What is the Karpathy connection in Anthropic's biology agents post?

Andrej Karpathy reported that vibe-coding a web app was easy but auth, payments, and deployment required a week of browser dashboard clicking. Anthropic uses this as a parallel to virologists manually reproducing complex NCBI Virus filters—environments built for humans, not agents.

How do I install and use gget virus?

Install gget via pip (pip install gget), then run gget virus with a taxon name or ID plus filter flags. Example: gget virus 'Zaire ebolavirus' --host human --geo_location Africa. See the Pachter Lab docs at pachterlab.github.io/gget/en/virus.html.

Anthropic VirBench: Why Biological Agents Need | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Anthropic VirBench: Why Biological Agents Need | explainx.ai Blog | explainx.ai

On June 8, 2026, Anthropic published "Paving the way for agents in biology"—an essay by Laura Luebbert (Broad Institute, FutureHouse) arguing that biological data infrastructure must be redesigned for agents, not just human browser clicks.

The case study is sharp: task state-of-the-art scientific agents (Claude, GPT, Biomni, Edison Analysis) with retrieving viral sequences from NCBI Virus—the database behind outbreak surveillance, diagnostic assay design, and protein model training data. Even frontier models failed reproducibility tests. Accuracy jumped to nearly 100% once the team added gget virus, a deterministic retrieval layer built with NCBI collaborators.

The lesson extends far beyond virology: agents need boring, reliable tools underneath creative reasoning—the same pattern loop engineering and harness engineering teach for coding agents.

TL;DR

Question	Answer
What was tested?	VirBench — 120 queries, 40 pathogens, manually verified ground-truth counts.
Worst agent accuracy (alone)?	16.9% mean (Claude Sonnet 4). Best: 91.3% (GPT-5.5)—still not enough for science.
With gget virus?	≥90% all agents; 99.7% peak (GPT-5.5). Variability largely gone.
Why it matters now?	May 2026 Bundibugyo Ebola outbreak in DRC—genomic answers depend on correct sequence retrieval first.
Broader lesson?	Build deterministic execution layers; let models hypothesize, not reinvent pagination.

Property	Detail
Queries	120 realistic viral sequence retrievals
Pathogens	40, from broad family searches to accession lookups
Filters per query	1–9 simultaneous (median 6); up to 16 filter types
Expected counts	0 to 3,226 sequences (median 22)
Ground truth	Manually verified via NCBI Virus web interface
Use cases	Surveillance, diagnostic assay design, protein model training data
Contributors	58 queries from Sabeti Lab diagnostics team

Agent	Mean accuracy	Stability (σ=μ threshold)
Claude Sonnet 4	16.9%	Low
Biomni OSS	22.5%	Low
Edison Analysis	40.0%	Moderate
GPT-5.2-pro	67.1%	Moderate
Claude Opus 4.7	83.2%	0.93 stability
GPT-5.5	91.3%	1.00 stability

bash

pip install gget

# Example: Zaire ebolavirus with filters
gget virus "Zaire ebolavirus" \
  --host human \
  --geo_location Africa \
  --collection_date_after 2014-01-01 \
  --collection_date_before 2014-06-20 \
  --min_seq_length 15200 \
  --max_n 1900

Agent	Accuracy without gget	Accuracy with gget
Claude Sonnet 4	16.9%	92.8%
Biomni OSS	22.5%	90.0%
Edison Analysis	40.0%	93.1%
GPT-5.2-pro	67.1%	98.9%
Claude Opus 4.7	83.2%	98.3%
GPT-5.5	91.3%	99.7%

System	Role
ToolUniverse	Tool aggregation for biomedical agents
Edison Scientific Robin	Research agent with tool harness
Biomni	General-purpose biomedical agent
gget virus	Deterministic viral sequence retrieval

Anthropic VirBench: Why Biological Agents Need Deterministic Tools Like gget virus (2026)

Related posts

LM Studio Bionic: Open-Model Agent for Code and Work Projects

Claude Code Desktop Browser: Built-In Web Browsing in the App (July 2026)

Fable 5 Advisor and Orchestrator Patterns: 92% Quality at 63% Cost (July 2026)

The hill town problem: biology wasn't built for agents

Karpathy's "click tax" — the same pain in software

Case study: NCBI Virus and the May 2026 Ebola outbreak

Bundibugyo virus, DRC, May 2026

VirBench: 120 queries, ground-truth counts, three runs each

Benchmark design

Agents tested

What happened when agents tried alone

Accuracy without gget virus

The reproducibility problem

When wrong retrieval changes biology

Failure modes

gget virus: the deterministic layer

What it coordinates

Install and basic usage

Results with gget virus: model choice mattered less

Remaining errors

The highway under the hill town

Broader ecosystem

Will better models make tools obsolete?

Implications for agent builders

1. Separate reasoning from retrieval

2. Test at 100%, not "pretty good"

3. Build agent-accessible interfaces

4. Connectors and MCP for science

5. Cheaper models + right tool > frontier model alone

Summary