← Back to blog

explainx / blog

LifeSciBench: OpenAI's 750-Task Benchmark for GPT-Rosalind in Biotech

OpenAI LifeSciBench (June 17, 2026) benchmarks AI on real biotech workflows— 750 expert tasks, 173 scientists, GPT-Rosalind 36.1% pass vs GPT-5.5 25.7%. Tacit Labs, rubric grading, artifact-heavy gaps, and pharma R&D implications.

·6 min read·Yash Thakker
OpenAIGPT-RosalindLife SciencesAI BenchmarkBiotech
LifeSciBench: OpenAI's 750-Task Benchmark for GPT-Rosalind in Biotech

On June 17, 2026, OpenAI published LifeSciBench — a benchmark built with 173 practicing scientists to answer a question glossier AI evals skip: Can this model do the messy work of drug discovery—not just recite biology?

The same week brought Deployment Simulation for pre-launch safety and GPT-Rosalind positioning in life sciences. LifeSciBench is the scoreboard for that bet.

Headline result: GPT-Rosalind hits 36.1% strict pass rate vs GPT-5.5 at 25.7% — meaningful progress, with most tasks still unsolved.


TL;DR

MetricValue
Tasks750 expert-authored
Scientists173 contributors, 453 reviewers
Artifacts1,062 files (figures, PDFs, sequences, etc.)
Rubric criteria19,020 (~25 per task)
GPT-Rosalind pass36.1%
GPT-5.5 pass25.7%
Tacit LabsNicole Fitzgerald — AI × biology applied lab
Paperopenai.com/index/introducing-life-sci-bench

Why Life Science Needs Its Own Benchmark

Most science benchmarks test isolated skills:

  • Multiple-choice biology
  • Single-step predictions
  • Clean reference answers

Real biotech work looks like:

  • Interpreting incomplete Phase 1/2 data for an FDA Type B meeting
  • Reconciling conflicting assay readouts
  • Designing CRISPR donors under operational constraints
  • Explaining uncertainty to a skeptical reviewer

LifeSciBench tasks read like emails to a senior scientist collaborator—prompt, context artifacts, free-response answer, rubric graded.


Seven Workflows Measured

OpenAI grouped industry survey responses into seven recurring workflows:

WorkflowExample demand
Evidence handlingExtract, reconcile, audit papers and records
AnalysisQuantitative interpretation with caveats
Design, optimization & predictionExperiments, constructs, assays
Scientific reasoningMechanism, hypothesis, conflict resolution
Validation & operationsLab execution, troubleshooting
TranslationBench → bedside, regulatory framing
Scientific communicationExpert-facing writeups

79% of tasks need multiple reasoning steps (avg 4 steps). 53% require artifacts—not prompt text alone.


Dataset Construction (Why Experts Trust It)

Rigor signals:

  • 173 task authors with PhD training + biotech/pharma experience
  • ~6 automated review cycles per task (avg)
  • ≥2 expert review rounds
  • ≥90% reviewer agreement in-domain
  • 453 independent validators (97% hold PhD+)

Reviewer agreement on benchmark quality: 96%+ in all categories (real-world relevance, reasoning test, grounding, usefulness).

This is closer to contract research organization (CRO) review than crowd-sourced QA.


Grading: 19,020 Rubric Criteria

Pass threshold: 70% rubric score per task.

Science rarely reduces to one correct string. Rubrics score:

  • Correct claims (+points)
  • Missing assay limitations (−implicit failure)
  • Wrong evidence weighting
  • Format expected by regulators or PI review

Partial credit matters: ~14% of tasks show models earning ≥50% rubric while failing pass threshold—useful but not deployable alone.


Example Task Flavor (DMD Gene Therapy)

LifeSciBench publishes a Duchenne muscular dystrophy accelerated-approval critique—micro-dystrophin AAV9 package with Western blot, immunofluorescence, NSAA functional data.

A strong answer flags:

  • Assay specificity (MANEX1A epitope sharing)
  • Invalid standards (138 kDa vs full-length dystrophin)
  • Revertant fiber confounding
  • External control bias on NSAA
  • Surrogate endpoint validity

GPT-Rosalind-style outputs must pressure-test like a skeptical FDA reviewer—not summarize the press release.


Results: Where GPT-Rosalind Wins and Loses

Overall

ModelPass rateNotes
GPT-Rosalind36.1%Life-science-tuned
GPT-5.525.7%General frontier

Strongest workflows (Rosalind gains)

WorkflowGPT-5.5GPT-Rosalind
Scientific Communication56.3%71.1% (n=9 — small)
Translation36.8%57.7%
Expert-useful outputs29.1%44.7%
Uncertainty handling29.3%44.8%

Models do best when tasks have clear evidence boundaries and need structured judgment.

Weakest areas

ChallengeGPT-Rosalind pass
Design / Optimization / Prediction~30.7%
Analysis~30.3%
With artifacts/URLs28.1% (vs 45.1% text-only)
Exact numeric outputs14.8%
Sequence/structure outputs24.0%
Construct generation27.3%

Artifact gap is the story: models struggle to read complex figures, large sequence files, and synthesize into decisions—exactly what wet labs produce daily.


GPT-Rosalind and Tacit Labs

OpenAI pairs the benchmark with GPT-Rosalind — a life-sciences-oriented model line (see also Rosalind Biodefense product threads from May 2026).

Nicole Fitzgerald (@ninklefitz), formerly Microsoft Research and Databricks Mosaic AI, announced Tacit Labs the same day—an applied research lab for AI + autonomous biotech tooling.

LifeSciBench measures models; Tacit Labs builds systems that might sit in real R&D workflows—complementary, not redundant.


Benchmark Politics on X

@scaling01 noted OpenAI comparing against xAI, Google — not Anthropic — framing a shift in competitive narrative.

Community context (Wired reporting): Anthropic restricts API use for competitor benchmarking—making cross-vendor LifeSciBench tables asymmetric.

@teortaxesTex criticized including Grok in charts—methodology debates will continue as labs pick favorable comparators.


Relation to Deployment Simulation

Released one day apart (June 16–17):

ToolQuestion
Deployment SimulationHow will the model behave in ChatGPT traffic?
LifeSciBenchCan the model do PhD-level biotech tasks?

Together: operational safety forecasting + domain capability measurement — OpenAI's pre-release stack for high-stakes verticals.

Contrast with ALE (agent autonomy) and Fable 5 cyber evals (security politics).


Limitations (OpenAI's)

  • Not live lab validation — tasks are self-contained
  • No multi-week iterative science
  • Specialty coverage incomplete
  • Exact-output tasks brittle (formatting vs science)
  • Benchmark ≠ discovery impact

Next step: deployment studies in real research programs.


Who Should Care

AudienceTakeaway
Biotech / pharma AI teamsRubric-heavy evals match how QA and regulatory think
Model labsArtifact-heavy multimodal science remains wide open
Investors36% pass = far from automating drug discovery
DevelopersGPT-Rosalind API access via OpenAI contributor program

OpenAI invites scientist contributors and GPT-Rosalind access requests via the announcement page.


Summary

LifeSciBench is the most serious public attempt yet to grade AI on industry-shaped biology work—FDA skepticism, assay traps, translation—not textbook drills.

GPT-Rosalind leads GPT-5.5 by 10+ points on pass rate but fails most tasks. The gap to production is artifacts, exact constructs, and live iteration.

Tacit Labs signals OpenAI is not stopping at benchmarks—they want tools inside labs, not just chatbots that read Nature abstracts.


Related Reading

Benchmark statistics cited from OpenAI LifeSciBench announcement (June 17, 2026).

Related posts