Anthropic VirBench: Why Biological Agents Need Deterministic Tools Like gget virus (2026)
Anthropic VirBench: coding agents beat biology agents until you add gget virus. Deterministic NCBI retrieval raised accuracy from 17% to 99.7% on viral queries.
On June 8, 2026, Anthropic published "Paving the way for agents in biology"—an essay by Laura Luebbert (Broad Institute, FutureHouse) arguing that biological data infrastructure must be redesigned for agents, not just human browser clicks.
The case study is sharp: task state-of-the-art scientific agents (Claude, GPT, Biomni, Edison Analysis) with retrieving viral sequences from NCBI Virus—the database behind outbreak surveillance, diagnostic assay design, and protein model training data. Even frontier models failed reproducibility tests. Accuracy jumped to nearly 100% once the team added gget virus, a deterministic retrieval layer built with NCBI collaborators.
The lesson extends far beyond virology: agents need boring, reliable tools underneath creative reasoning—the same pattern loop engineering and harness engineering teach for coding agents.
The hill town problem: biology wasn't built for agents
Laura Luebbert opens with an analogy: using AI agents on today's biological data is like driving through an old Italian hill town designed before cars—beautiful, thoughtful, but full of narrow winding streets (idiosyncratic file formats, scattered databases, one-off scripts).
Software, by contrast, was built for cars:
Paved roads → version control
Clear lanes → documented APIs
Standardized signals → package managers
Fast start-to-finish travel → testable outputs (a GitHub patch that passes CI)
Coding agents advanced quickly because the infrastructure matches agent needs. Biological agents lag because retrieval and validation layers are brittle, heterogeneous, and process-dependent—and biology offers few simple, verifiable rewards comparable to tests pass.
The bottleneck is not only reasoning. It is the absence of widespread deterministic execution layers for querying biological data. A scientist can express intent ("find all human kinases with this domain and pull their structures"), but agents lack a dependable, repeatable path to the databases.
In biology, small retrieval errors have severe downstream consequences:
Wrong genome build → invalid coordinates
Mixing RefSeq and GenBank unintentionally
Treating partial genomes as complete
Confusing segment names in segmented viruses
Missing records due to inconsistent metadata fields
It does not matter how powerful the model is if the route depends on local knowledge hidden in a web UI.
Karpathy's "click tax" — the same pain in software
This mismatch is not unique to biology. Luebbert cites Andrej Karpathy's talk on software in the AI era: he vibe-coded a small web app quickly, then lost a week on authentication, payments, and deployment—clicking through browser dashboards.
"The code was the easiest part! Most of the work was in the browser, clicking things."
Documentation kept saying "go to this URL, click this dropdown." Karpathy's conclusion: nobody should have to do this—we must build for agents.
Anthropic's virology case study is the biological version of that complaint. Long before agents, computational biologists built partial fixes—Biopython, BioPerl, Entrez Direct, BioMart, gget—to move data out of browsers into scriptable workflows. But biological data still lives in a messy network of roads, each with its own identifiers, conventions, and degree of programmatic access.
Case study: NCBI Virus and the May 2026 Ebola outbreak
NCBI Virus aggregates viral sequence records from GenBank, RefSeq, and the international INSDC ecosystem (NCBI, ENA, DDBJ), including Pathoplexus—behind a searchable web interface.
Virology labs pass around long lists of complex filters that users manually reproduce in the browser. Exactly the workflow Karpathy described—except the stakes are public health.
Bundibugyo virus, DRC, May 2026
On May 14, 2026, INRB Kinshasa analyzed 13 blood samples and confirmed Bundibugyo virus disease in eight the next day. An Ebola outbreak was declared. By May 29, WHO reported 1,000+ confirmed and suspected cases and 200+ deaths in the DRC.
Researchers generated the first near-complete outbreak genomes, establishing a new spillover event. Public health officials need immediate answers:
How different is this virus from prior Ebola viruses?
Do existing diagnostics still detect it?
Will existing therapeutics still protect patients?
Answering these requires comparing new genomes against historical Ebola records in NCBI Virus and Pathoplexus. The first steps should be automatable. Instead, they often involve manual browser filtering, hoping the dataset is complete and correct.
Much of NCBI Virus's filtering logic lives only in the web interface. A seasoned virologist might take a few clicks for SARS-CoV-2 surface glycoprotein sequences from 2025. Programmatically, it can require a multi-hundred-line script gluing REST, Datasets, and E-utilities APIs—paginating, reconciling identifiers, downloading hundreds of gigabytes, then filtering locally.
Even when APIs exist, agents struggle when:
API filtering ≠ web UI semantics
Metadata fields are poorly documented
Identifiers differ across sources
"The right answer" depends on expert conventions machines must infer
VirBench: 120 queries, ground-truth counts, three runs each
Claude Sonnet 4 and Claude Opus 4.7 (Anthropic Messages API)
GPT-5.2-pro and GPT-5.5 (OpenAI Responses API, web search + code execution)
Biomni OSS v0.0.8 (Claude Sonnet 4 backend)
Edison Analysis (Edison client SDK)
Each query ran three independent times per agent to test reproducibility.
What happened when agents tried alone
Performance varied widely—and even the best model was not reliably good enough for dataset construction, where the effective bar is 100%.
Accuracy without gget virus
Agent
Mean accuracy
Stability (σ=μ threshold)
Claude Sonnet 4
16.9%
Low
Biomni OSS
22.5%
Low
Edison Analysis
40.0%
Moderate
GPT-5.2-pro
67.1%
Moderate
Claude Opus 4.7
83.2%
0.93 stability
GPT-5.5
91.3%
1.00 stability
Newer frontier models improved substantially—but residual errors remain consequential. A missing or incorrect record can determine whether a diagnostic assay appears to cover circulating diversity, or whether an outbreak is inferred to have started weeks earlier or later.
The reproducibility problem
The same model often returned different answers on identical prompts. For the example Ebolavirus query, Claude Sonnet 4 returned:
Run 1: 106 sequences (expected: 266)
Run 2: 15 sequences
Run 3: 5 sequences
That undermines both accuracy and reproducibility—requirements for any scientific workflow.
When wrong retrieval changes biology
Anthropic illustrated downstream impact with two analyses:
Phylogenetic trees (TMRCA): A manually curated NCBI dataset inferred a January 2014 time to most recent common ancestor for the 2014 West African Ebola epidemic—consistent with prior literature. Agent-retrieved sets produced trees pushing TMRCA to 1922, or shifting it to April 2014 by missing Guinea sequences—changing inferred outbreak timing.
Therapeutic epitopes: For antibody candidates maftivimab and MBP134, three Sonnet 4 runs produced three different impressions of mutation variability in target regions—because underlying sequence sets were incomplete or wrong.
Failure modes
Agents often understood the task but lacked machine-actionable execution:
Under-counted when pagination stopped early (Influenza A, HIV-1, SARS-CoV-2)
Over-counted when filters were applied incorrectly
Struggled with metadata fields whose meaning depends on context (e.g., geographic info stored in virusName rather than location)
Answers could look plausible while being wrong—especially dangerous because sequence retrieval is usually step one in a long pipeline.
gget virus: the deterministic layer
The team developed gget virus in collaboration with NCBI researchers—not a simple API wrapper, but a system that reproduces NCBI Virus web-interface behavior across fragmented backends.
What it coordinates
NCBI Datasets REST API — lightweight metadata
NCBI Datasets CLI — cached bulk packages for SARS-CoV-2 and Influenza A
E-utilities — GenBank records for protein-level filters
Local filtering — when web UI semantics aren't exposed programmatically
Batching and retry logic — comprehensive retrieval without arbitrary cutoffs
Standardized outputs + logs — auditable, human- and machine-readable
The preprint reports >98% data transfer reduction for representative high-volume queries by applying metadata constraints before sequence download.
Written by Ferdous Nasri; developed with Sarah Gurev, Patrick Varilly, Krithik Ramesh, Nuala A. O'Leary, Jonah Cool, Bernhard Y. Renard, Pardis Sabeti, and Laura Luebbert.
Results with gget virus: model choice mattered less
When agents were instructed to use gget virus, the picture changed dramatically:
Agent
Accuracy without gget
Accuracy with gget
Claude Sonnet 4
16.9%
92.8%
Biomni OSS
22.5%
90.0%
Edison Analysis
40.0%
93.1%
GPT-5.2-pro
67.1%
98.9%
Claude Opus 4.7
83.2%
98.3%
GPT-5.5
91.3%
99.7%
Run-to-run variability was largely eliminated (stability 0.92–1.00). The performance gap between models narrowed dramatically. Adding a deterministic retrieval layer made model choice much less important—cheaper models plus the right tool beat expensive models fighting messy APIs alone.
One notable run: GPT-5.5 independently discovered and used gget virus on one query despite not being prompted to— the only correct answer for that question among 360 runs.
Remaining errors
Residual failures shifted from "can't access data reliably" to "agent misused the tool":
Incorrect local filtering after download
Partial processing of large FASTA files
Reverting to alternative APIs despite instructions
Wrong parameters on gget calls
The retrieval layer worked; agent invocation and output preservation still need guardrails—echoing @mosyaseen's loop-engineering point: you need something that can say no.
The highway under the hill town
Anthropic returns to the city analogy: gget virus is a highway tunnel under pedestrian infrastructure—on-ramps, interchanges, exit numbers tied to known mile markers.
Karpathy's prescription applies directly: "make [genomic data] accessible to agents."
Creative work—hypothesis generation, experimental design, mechanism reasoning—should stay with models. The layer underneath must be boringly reliable:
Gene identifiers
Schemas
Retrieval logic
Coordinate systems
Metadata conventions
Data access paths
Broader ecosystem
gget virus joins a growing set of context engines for scientific agents:
The design question: where does determinism belong, and how do you build it so agents can invoke it without brittle post-processing?
As Nils Homer noted (cited in Anthropic's footnotes): "AI assistants need to work with your code, your outputs, and your analysis logic"—so agents can inspect how data was retrieved, not just what was returned.
Will better models make tools obsolete?
Anthropic addresses the obvious objection: if you extrapolate model curves, agents might eventually navigate messy portals alone.
Maybe. But even if an agent can fight through a confusing bioinformatics workflow, that does not mean it should every time:
Too expensive (token burn on pagination)
Too slow (multi-hour API gluing)
Too hard to audit (no retrieval logs)
Too hard to trust (plausible wrong counts)
If today's harnesses become obsolete, the lesson for database maintainers holds: design for agents as scaled users—explicit filtering semantics, stable identifiers, machine-readable logs, deterministic endpoints.
This parallels software agent discourse from the same week: Peter Steinberger's loop tweet argued engineers should design loops and skills, not re-prompt agents from scratch every time. Biology's version is design deterministic retrieval tools, not let each agent reinvent NCBI pagination.
Implications for agent builders
1. Separate reasoning from retrieval
Let the model plan and interpret. Let deterministic tools fetch, filter, and log. VirBench shows reasoning without retrieval fails reproducibility even at 91% mean accuracy.
2. Test at 100%, not "pretty good"
For dataset construction, 91% is not passing. Benchmark with ground-truth counts, multiple runs, and downstream analyses (phylogenetics, epitope mapping)—not just "did it return something."
3. Build agent-accessible interfaces
Biological databases need:
Filtering semantics matching web UIs
Documented metadata fields with examples
Pagination that cannot silently truncate
Logs showing how results were produced
Stable identifiers across sources
4. Connectors and MCP for science
Software teams solve this with MCP servers and Claude connectors. Life sciences needs the same pattern: thin, deterministic tool surfaces agents call instead of browsing.
5. Cheaper models + right tool > frontier model alone
VirBench's most practical finding: gget virus democratized accuracy. Reliable science should not require the newest or most expensive model—or insider knowledge of which model handles which database best.
Anthropic's June 8, 2026 biology agents essay makes a precise claim: coding agents outran biological agents because software infrastructure was built for programmatic access, and biology wasn't.
gget virus added: ≥90% for every agent, 99.7% peak, variability largely gone
Wrong retrieval changed phylogenetic outbreak dates and therapeutic epitope conclusions
The prescription is not "wait for smarter models." It is build deterministic execution layers—boring, auditable, repeatable—and let agents be creative on top.
For outbreak response in the DRC, diagnostic assay design, and protein model training data, that infrastructure is not a nice-to-have. It is the difference between a plausible-looking wrong answer and science you can trust.
Published June 9, 2026. VirBench metrics and outbreak statistics from Anthropic's June 8, 2026 post and Nasri et al. arXiv:2606.06749—verify against upstream before citing in research or public-health contexts.
Related:Long-Read Genome Sequencing and Rare Disease Diagnosis — the same data-quality bottleneck VirBench identified in viral queries applies to variant interpretation in genomics: AI models are only as accurate as the curated databases they access.