What is OpenAI LifeSciBench?

LifeSciBench is an expert-written benchmark announced June 17, 2026, with 750 tasks across seven biological research workflows and seven domains. It measures whether AI can support real biotech/pharma work—evidence handling, experimental design, translation, communication—not just biology trivia. Developed with 173 PhD-level scientists; graded with 19,020 rubric criteria.

How does GPT-Rosalind perform on LifeSciBench?

GPT-Rosalind achieves 36.1% strict pass rate (70% rubric threshold) vs GPT-5.5 at 25.7% overall. Strongest gains appear in Scientific Communication and Translation workflows. Weakest areas include Design/Optimization/Prediction (~31% pass) and artifact-heavy tasks (28.1% pass with files vs 45.1% text-only).

Tacit Labs is a new company founded by ex-Microsoft Research scientist Nicole Fitzgerald (@ninklefitz), announced alongside LifeSciBench. It focuses on applied research at the intersection of AI and biology—tools for autonomous biotech lab workflows, complementary to OpenAI's GPT-Rosalind push.

What makes LifeSciBench different from other science benchmarks?

Tasks mirror requests to a knowledgeable collaborator: scientific prompts, attached artifacts (figures, PDFs, sequences, structures), free-response answers, and granular rubrics averaging 25 criteria per task. 79% require multiple reasoning steps; 53% need interpreting attached files—not prompt text alone.

Can Anthropic models be benchmarked on LifeSciBench?

OpenAI built LifeSciBench with industry scientists; third-party comparisons to Claude or other labs depend on API access policies. Community context on X notes Anthropic restricts API use for benchmarking competitors— affecting cross-vendor leaderboard narratives.

What are LifeSciBench limitations?

Self-contained tasks do not capture iterative lab research over weeks. Strong benchmark scores do not prove downstream discovery impact. Models still fail exact sequence/structure outputs (~15–27% pass) critical for CRISPR donors and siRNA design. Deployment validation in live R&D settings is the stated next step.

LifeSciBench: GPT-Rosalind Life Science AI Benchmark | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

LifeSciBench: GPT-Rosalind Life Science AI Benchmark | explainx.ai Blog | explainx.ai

On June 17, 2026, OpenAI published LifeSciBench — a benchmark built with 173 practicing scientists to answer a question glossier AI evals skip: Can this model do the messy work of drug discovery—not just recite biology?

The same week brought Deployment Simulation for pre-launch safety and GPT-Rosalind positioning in life sciences. LifeSciBench is the scoreboard for that bet.

Headline result: GPT-Rosalind hits 36.1% strict pass rate vs GPT-5.5 at 25.7% — meaningful progress, with most tasks still unsolved.

TL;DR

Metric	Value
Tasks	750 expert-authored
Scientists	173 contributors, 453 reviewers
Artifacts	1,062 files (figures, PDFs, sequences, etc.)
Rubric criteria	19,020 (~25 per task)
GPT-Rosalind pass	36.1%
GPT-5.5 pass

Workflow	Example demand
Evidence handling	Extract, reconcile, audit papers and records
Analysis	Quantitative interpretation with caveats
Design, optimization & prediction	Experiments, constructs, assays
Scientific reasoning	Mechanism, hypothesis, conflict resolution
Validation & operations	Lab execution, troubleshooting
Translation	Bench → bedside, regulatory framing
Scientific communication	Expert-facing writeups

Model	Pass rate	Notes
GPT-Rosalind	36.1%	Life-science-tuned
GPT-5.5	25.7%	General frontier

Workflow	GPT-5.5	GPT-Rosalind
Scientific Communication	56.3%	71.1% (n=9 — small)
Translation	36.8%	57.7%
Expert-useful outputs	29.1%	44.7%
Uncertainty handling	29.3%	44.8%

Challenge	GPT-Rosalind pass
Design / Optimization / Prediction	~30.7%
Analysis	~30.3%
With artifacts/URLs	28.1% (vs 45.1% text-only)
Exact numeric outputs	14.8%
Sequence/structure outputs	24.0%
Construct generation	27.3%

Tool	Question
Deployment Simulation	How will the model behave in ChatGPT traffic?
LifeSciBench	Can the model do PhD-level biotech tasks?

Audience	Takeaway
Biotech / pharma AI teams	Rubric-heavy evals match how QA and regulatory think
Model labs	Artifact-heavy multimodal science remains wide open
Investors	36% pass = far from automating drug discovery
Developers	GPT-Rosalind API access via OpenAI contributor program

LifeSciBench: OpenAI's 750-Task Benchmark for GPT-Rosalind in Biotech

TL;DR

Related posts

Codex $HOME Deletion: GPT-5.6, Full Access, and Tibo's July 16 Investigation

Codex + ChatGPT Work Hit 8M Users — GPT-5.6 Sol Drives 2.5× Usage Spike

OpenAI Codex Micro: $230 Work Louder Keyboard for Agent Dashboards

Why Life Science Needs Its Own Benchmark

Seven Workflows Measured

Dataset Construction (Why Experts Trust It)

Grading: 19,020 Rubric Criteria

Example Task Flavor (DMD Gene Therapy)

Results: Where GPT-Rosalind Wins and Loses

Overall

Strongest workflows (Rosalind gains)

Weakest areas

GPT-Rosalind and Tacit Labs

Benchmark Politics on X

Relation to Deployment Simulation

Limitations (OpenAI's)

Who Should Care

Summary

TL;DR

Related posts

Codex $HOME Deletion: GPT-5.6, Full Access, and Tibo's July 16 Investigation

Codex + ChatGPT Work Hit 8M Users — GPT-5.6 Sol Drives 2.5× Usage Spike

OpenAI Codex Micro: $230 Work Louder Keyboard for Agent Dashboards

Why Life Science Needs Its Own Benchmark

Seven Workflows Measured

Dataset Construction (Why Experts Trust It)

Grading: 19,020 Rubric Criteria

Example Task Flavor (DMD Gene Therapy)

Results: Where GPT-Rosalind Wins and Loses

Overall

Strongest workflows (Rosalind gains)

Weakest areas

GPT-Rosalind and Tacit Labs

Benchmark Politics on X

Relation to Deployment Simulation

Limitations (OpenAI's)

Who Should Care

Summary

Related Reading