On June 8, 2026, Anthropic published "Paving the way for agents in biology"—an essay by Laura Luebbert (Broad Institute, FutureHouse) arguing that biological data infrastructure must be redesigned for agents, not just human browser clicks.
The case study is sharp: task state-of-the-art scientific agents (Claude, GPT, Biomni, Edison Analysis) with retrieving viral sequences from NCBI Virus—the database behind outbreak surveillance, diagnostic assay design, and protein model training data. Even frontier models failed reproducibility tests. Accuracy jumped to nearly 100% once the team added gget virus, a deterministic retrieval layer built with NCBI collaborators.
The lesson extends far beyond virology: agents need boring, reliable tools underneath creative reasoning—the same pattern loop engineering and harness engineering teach for coding agents.
TL;DR
| Question | Answer |
|---|---|
| What was tested? | VirBench — 120 queries, 40 pathogens, manually verified ground-truth counts. |
| Worst agent accuracy (alone)? | 16.9% mean (Claude Sonnet 4). Best: 91.3% (GPT-5.5)—still not enough for science. |
| With gget virus? | ≥90% all agents; 99.7% peak (GPT-5.5). Variability largely gone. |
| Why it matters now? | May 2026 Bundibugyo Ebola outbreak in DRC—genomic answers depend on correct sequence retrieval first. |
| Broader lesson? | Build deterministic execution layers; let models hypothesize, not reinvent pagination. |
| Full paper? | arXiv:2606.06749 — Nasri et al., 2026. |
The hill town problem: biology wasn't built for agents
Laura Luebbert opens with an analogy: using AI agents on today's biological data is like driving through an old Italian hill town designed before cars—beautiful, thoughtful, but full of narrow winding streets (idiosyncratic file formats, scattered databases, one-off scripts).
Software, by contrast, was built for cars:
- Paved roads → version control
- Clear lanes → documented APIs
- Standardized signals → package managers
- Fast start-to-finish travel → testable outputs (a GitHub patch that passes CI)
Coding agents advanced quickly because the infrastructure matches agent needs. Biological agents lag because retrieval and validation layers are brittle, heterogeneous, and process-dependent—and biology offers few simple, verifiable rewards comparable to tests pass.
The bottleneck is not only reasoning. It is the absence of widespread deterministic execution layers for querying biological data. A scientist can express intent ("find all human kinases with this domain and pull their structures"), but agents lack a dependable, repeatable path to the databases.
In biology, small retrieval errors have severe downstream consequences:
- Wrong genome build → invalid coordinates
- Mixing RefSeq and GenBank unintentionally
- Treating partial genomes as complete
- Confusing segment names in segmented viruses
- Missing records due to inconsistent metadata fields
It does not matter how powerful the model is if the route depends on local knowledge hidden in a web UI.
Complete AI Builder Bootcamp
Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.
The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.
The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.
Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.
Karpathy's "click tax" — the same pain in software
This mismatch is not unique to biology. Luebbert cites Andrej Karpathy's talk on software in the AI era: he vibe-coded a small web app quickly, then lost a week on authentication, payments, and deployment—clicking through browser dashboards.
"The code was the easiest part! Most of the work was in the browser, clicking things."
Documentation kept saying "go to this URL, click this dropdown." Karpathy's conclusion: nobody should have to do this—we must build for agents.
Anthropic's virology case study is the biological version of that complaint. Long before agents, computational biologists built partial fixes—Biopython, BioPerl, Entrez Direct, BioMart, gget—to move data out of browsers into scriptable workflows. But biological data still lives in a messy network of roads, each with its own identifiers, conventions, and degree of programmatic access.
Case study: NCBI Virus and the May 2026 Ebola outbreak
NCBI Virus aggregates viral sequence records from GenBank, RefSeq, and the international INSDC ecosystem (NCBI, ENA, DDBJ), including Pathoplexus—behind a searchable web interface.
Virology labs pass around long lists of complex filters that users manually reproduce in the browser. Exactly the workflow Karpathy described—except the stakes are public health.
Bundibugyo virus, DRC, May 2026
On May 14, 2026, INRB Kinshasa analyzed 13 blood samples and confirmed Bundibugyo virus disease in eight the next day. An Ebola outbreak was declared. By May 29, WHO reported 1,000+ confirmed and suspected cases and 200+ deaths in the DRC.
Researchers generated the first near-complete outbreak genomes, establishing a new spillover event. Public health officials need immediate answers:
- How different is this virus from prior Ebola viruses?
- Do existing diagnostics still detect it?
- Will existing therapeutics still protect patients?
Answering these requires comparing new genomes against historical Ebola records in NCBI Virus and Pathoplexus. The first steps should be automatable. Instead, they often involve manual browser filtering, hoping the dataset is complete and correct.
Much of NCBI Virus's filtering logic lives only in the web interface. A seasoned virologist might take a few clicks for SARS-CoV-2 surface glycoprotein sequences from 2025. Programmatically, it can require a multi-hundred-line script gluing REST, Datasets, and E-utilities APIs—paginating, reconciling identifiers, downloading hundreds of gigabytes, then filtering locally.
Even when APIs exist, agents struggle when:
- API filtering ≠ web UI semantics
- Metadata fields are poorly documented
- Identifiers differ across sources
- "The right answer" depends on expert conventions machines must infer
VirBench: 120 queries, ground-truth counts, three runs each
To measure the gap, Luebbert's team built VirBench—documented in the preprint "Deterministic access to global viral sequence data enables robust agentic scientific discovery" (Nasri et al., 2026).
Benchmark design
| Property | Detail |
|---|---|
| Queries | 120 realistic viral sequence retrievals |
| Pathogens | 40, from broad family searches to accession lookups |
| Filters per query | 1–9 simultaneous (median 6); up to 16 filter types |
| Expected counts | 0 to 3,226 sequences (median 22) |
| Ground truth | Manually verified via NCBI Virus web interface |
| Use cases | Surveillance, diagnostic assay design, protein model training data |
| Contributors | 58 queries from Sabeti Lab diagnostics team |
Example query (Ebolavirus):
Retrieve viral sequences from NCBI for TaxID 3052462 (Orthoebolavirus zairense (ZEBOV)) with: host organism human; geographic location Africa; collected 01/01/2014–06/20/2014; minimum sequence length 15,200 bases; maximum 1,900 ambiguous characters (N's); exclude lab-passaged samples.
Agents tested
Evaluated February 26, 2026:
- Claude Sonnet 4 and Claude Opus 4.7 (Anthropic Messages API)
- GPT-5.2-pro and GPT-5.5 (OpenAI Responses API, web search + code execution)
- Biomni OSS v0.0.8 (Claude Sonnet 4 backend)
- Edison Analysis (Edison client SDK)
Each query ran three independent times per agent to test reproducibility.
What happened when agents tried alone
Performance varied widely—and even the best model was not reliably good enough for dataset construction, where the effective bar is 100%.
Accuracy without gget virus
| Agent | Mean accuracy | Stability (σ=μ threshold) |
|---|---|---|
| Claude Sonnet 4 | 16.9% | Low |
| Biomni OSS | 22.5% | Low |
| Edison Analysis | 40.0% | Moderate |
| GPT-5.2-pro | 67.1% | Moderate |
| Claude Opus 4.7 | 83.2% | 0.93 stability |
| GPT-5.5 | 91.3% | 1.00 stability |
Newer frontier models improved substantially—but residual errors remain consequential. A missing or incorrect record can determine whether a diagnostic assay appears to cover circulating diversity, or whether an outbreak is inferred to have started weeks earlier or later.
The reproducibility problem
The same model often returned different answers on identical prompts. For the example Ebolavirus query, Claude Sonnet 4 returned:
- Run 1: 106 sequences (expected: 266)
- Run 2: 15 sequences
- Run 3: 5 sequences
That undermines both accuracy and reproducibility—requirements for any scientific workflow.
When wrong retrieval changes biology
Anthropic illustrated downstream impact with two analyses:
Phylogenetic trees (TMRCA): A manually curated NCBI dataset inferred a January 2014 time to most recent common ancestor for the 2014 West African Ebola epidemic—consistent with prior literature. Agent-retrieved sets produced trees pushing TMRCA to 1922, or shifting it to April 2014 by missing Guinea sequences—changing inferred outbreak timing.
Therapeutic epitopes: For antibody candidates maftivimab and MBP134, three Sonnet 4 runs produced three different impressions of mutation variability in target regions—because underlying sequence sets were incomplete or wrong.
Failure modes
Agents often understood the task but lacked machine-actionable execution:
- Under-counted when pagination stopped early (Influenza A, HIV-1, SARS-CoV-2)
- Over-counted when filters were applied incorrectly
- Struggled with metadata fields whose meaning depends on context (e.g., geographic info stored in
virusNamerather thanlocation) - Performance degraded beyond 3–4 simultaneous filters
Answers could look plausible while being wrong—especially dangerous because sequence retrieval is usually step one in a long pipeline.
gget virus: the deterministic layer
The team developed gget virus in collaboration with NCBI researchers—not a simple API wrapper, but a system that reproduces NCBI Virus web-interface behavior across fragmented backends.
What it coordinates
- NCBI Datasets REST API — lightweight metadata
- NCBI Datasets CLI — cached bulk packages for SARS-CoV-2 and Influenza A
- E-utilities — GenBank records for protein-level filters
- Local filtering — when web UI semantics aren't exposed programmatically
- Batching and retry logic — comprehensive retrieval without arbitrary cutoffs
- Standardized outputs + logs — auditable, human- and machine-readable
The preprint reports >98% data transfer reduction for representative high-volume queries by applying metadata constraints before sequence download.
Install and basic usage
pip install gget
# Example: Zaire ebolavirus with filters
gget virus "Zaire ebolavirus" \
--host human \
--geo_location Africa \
--collection_date_after 2014-01-01 \
--collection_date_before 2014-06-20 \
--min_seq_length 15200 \
--max_n 1900
Documentation: gget virus module (Pachter Lab)
Written by Ferdous Nasri; developed with Sarah Gurev, Patrick Varilly, Krithik Ramesh, Nuala A. O'Leary, Jonah Cool, Bernhard Y. Renard, Pardis Sabeti, and Laura Luebbert.
Results with gget virus: model choice mattered less
When agents were instructed to use gget virus, the picture changed dramatically:
| Agent | Accuracy without gget | Accuracy with gget |
|---|---|---|
| Claude Sonnet 4 | 16.9% | 92.8% |
| Biomni OSS | 22.5% | 90.0% |
| Edison Analysis | 40.0% | 93.1% |
| GPT-5.2-pro | 67.1% | 98.9% |
| Claude Opus 4.7 | 83.2% | 98.3% |
| GPT-5.5 | 91.3% | 99.7% |
Run-to-run variability was largely eliminated (stability 0.92–1.00). The performance gap between models narrowed dramatically. Adding a deterministic retrieval layer made model choice much less important—cheaper models plus the right tool beat expensive models fighting messy APIs alone.
One notable run: GPT-5.5 independently discovered and used gget virus on one query despite not being prompted to— the only correct answer for that question among 360 runs.
Remaining errors
Residual failures shifted from "can't access data reliably" to "agent misused the tool":
- Incorrect local filtering after download
- Partial processing of large FASTA files
- Reverting to alternative APIs despite instructions
- Wrong parameters on gget calls
The retrieval layer worked; agent invocation and output preservation still need guardrails—echoing @mosyaseen's loop-engineering point: you need something that can say no.
The highway under the hill town
Anthropic returns to the city analogy: gget virus is a highway tunnel under pedestrian infrastructure—on-ramps, interchanges, exit numbers tied to known mile markers.
Karpathy's prescription applies directly: "make [genomic data] accessible to agents."
Creative work—hypothesis generation, experimental design, mechanism reasoning—should stay with models. The layer underneath must be boringly reliable:
- Gene identifiers
- Schemas
- Retrieval logic
- Coordinate systems
- Metadata conventions
- Data access paths
Broader ecosystem
gget virus joins a growing set of context engines for scientific agents:
| System | Role |
|---|---|
| ToolUniverse | Tool aggregation for biomedical agents |
| Edison Scientific Robin | Research agent with tool harness |
| Biomni | General-purpose biomedical agent |
| gget virus | Deterministic viral sequence retrieval |
The design question: where does determinism belong, and how do you build it so agents can invoke it without brittle post-processing?
As Nils Homer noted (cited in Anthropic's footnotes): "AI assistants need to work with your code, your outputs, and your analysis logic"—so agents can inspect how data was retrieved, not just what was returned.
Will better models make tools obsolete?
Anthropic addresses the obvious objection: if you extrapolate model curves, agents might eventually navigate messy portals alone.
Maybe. But even if an agent can fight through a confusing bioinformatics workflow, that does not mean it should every time:
- Too expensive (token burn on pagination)
- Too slow (multi-hour API gluing)
- Too hard to audit (no retrieval logs)
- Too hard to trust (plausible wrong counts)
If today's harnesses become obsolete, the lesson for database maintainers holds: design for agents as scaled users—explicit filtering semantics, stable identifiers, machine-readable logs, deterministic endpoints.
This parallels software agent discourse from the same week: Peter Steinberger's loop tweet argued engineers should design loops and skills, not re-prompt agents from scratch every time. Biology's version is design deterministic retrieval tools, not let each agent reinvent NCBI pagination.
Implications for agent builders
1. Separate reasoning from retrieval
Let the model plan and interpret. Let deterministic tools fetch, filter, and log. VirBench shows reasoning without retrieval fails reproducibility even at 91% mean accuracy.
2. Test at 100%, not "pretty good"
For dataset construction, 91% is not passing. Benchmark with ground-truth counts, multiple runs, and downstream analyses (phylogenetics, epitope mapping)—not just "did it return something."
3. Build agent-accessible interfaces
Biological databases need:
- Filtering semantics matching web UIs
- Documented metadata fields with examples
- Pagination that cannot silently truncate
- Logs showing how results were produced
- Stable identifiers across sources
4. Connectors and MCP for science
Software teams solve this with MCP servers and Claude connectors. Life sciences needs the same pattern: thin, deterministic tool surfaces agents call instead of browsing.
5. Cheaper models + right tool > frontier model alone
VirBench's most practical finding: gget virus democratized accuracy. Reliable science should not require the newest or most expensive model—or insider knowledge of which model handles which database best.
Related reading
ExplainX guides
- Loop engineering: design loops that prompt agents
- Agent harness engineering: seven planes
- Anthropic harness engineering for coding
- What is MCP? Complete guide
- Claude connectors and MCP servers
Primary sources
- Anthropic Science: Paving the way for agents in biology (June 8, 2026)
- arXiv preprint: Deterministic access to global viral sequence data (2606.06749)
- gget virus documentation (Pachter Lab)
- Elliot Hershberg: How Software in the Life Sciences Actually Works — cited in Anthropic footnotes on fragmented bioinformatics tooling
Summary
Anthropic's June 8, 2026 biology agents essay makes a precise claim: coding agents outran biological agents because software infrastructure was built for programmatic access, and biology wasn't.
VirBench proved it with numbers:
- 120 queries, 40 pathogens, agents alone: 16.9%–91.3% accuracy with dangerous run-to-run variance
gget virusadded: ≥90% for every agent, 99.7% peak, variability largely gone- Wrong retrieval changed phylogenetic outbreak dates and therapeutic epitope conclusions
The prescription is not "wait for smarter models." It is build deterministic execution layers—boring, auditable, repeatable—and let agents be creative on top.
For outbreak response in the DRC, diagnostic assay design, and protein model training data, that infrastructure is not a nice-to-have. It is the difference between a plausible-looking wrong answer and science you can trust.
Published June 9, 2026. VirBench metrics and outbreak statistics from Anthropic's June 8, 2026 post and Nasri et al. arXiv:2606.06749—verify against upstream before citing in research or public-health contexts.