What is GeneBench-Pro?

GeneBench-Pro is a research-level benchmark from OpenAI (announced June 30, 2026) that tests whether AI agents can handle judgment-heavy computational biology — messy data, QC decisions, choosing analysis paths, and iterative refinement. It expands GeneBench with 129 problems across genomics, quantitative biology, and translational medicine. Each problem takes human experts an estimated 20–40 hours.

How well does GPT-5.6 Sol score on GeneBench-Pro?

GPT-5.6 Sol achieves 28.7% pass rate at the highest reasoning level and 31.5% with Pro mode enabled — up from below 5% when OpenAI began building the original GeneBench with GPT-5. At the lowest reasoning level, GPT-5.6 Sol scores in single digits, showing test-time compute scaling matters heavily.

How is GeneBench-Pro different from typical biology benchmarks?

Most benchmarks test curated data plus a well-defined routine analysis. GeneBench-Pro tests end-to-end scientific analysis — EDA, QC, modeling choice, diagnostics, and revision when data contradict assumptions. OpenAI calls this chain of judgment calls "research taste."

Why does GeneBench-Pro use synthetic data?

Synthetic generation with known causal structure lets OpenAI tune difficulty, ensure reasonable analytical choices still pass numerically, verify wrong paths fail via ablation, and grade deterministically — avoiding arbitrary author preferences that plague messy historical dataset benchmarks.

Is GeneBench-Pro open source?

OpenAI open-sourced 10 representative questions on Hugging Face with an interactive web interface. A 50-question subset will go to Artificial Analysis for independent third-party benchmarking. The full 129-problem suite and paper are linked from the official announcement.

How does GeneBench-Pro relate to LifeSciBench?

LifeSciBench (June 17, 2026) tests industry biotech workflows — FDA meetings, assay design, regulatory critique — with rubric-graded free responses. GeneBench-Pro tests computational biology research judgment with deterministic numerical grading on synthetic genomics problems. Both probe life-science AI beyond textbook QA.

GeneBench-Pro: OpenAI Computational Biology Benchmark + GPT-5.6 Sol Scores | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

GeneBench-Pro: OpenAI Computational Biology Benchmark + GPT-5.6 Sol Scores | explainx.ai Blog | explainx.ai

Scientific data rarely arrive with instructions. A computational biologist must decide whether a signal is biology or batch noise, whether the cohort supports the estimand, and when to abandon a first analysis plan. Recalling facts and running a canned pipeline are not the same skill.

On June 30, 2026, OpenAI introduced GeneBench-Pro — a research-level benchmark built to measure that harder layer: judgment-heavy analysis in computational biology. Greg Brockman amplified it July 1: problems that would take a human expert 20–40 hours each, with GPT-5.6 Sol as the current frontier result.

This post breaks down what GeneBench-Pro tests, how problems are built, where models fail, and why the scores matter for GPT-5.6 availability hype.

Update — July 11, 2026: OpenAI's $50K Bio Bounty stress-tests biosafety safeguards on GPT-5.6 the same week capability benchmarks rise — capability and red-team moving together.

TL;DR

Typical benchmark scope	Real end-to-end analysis
Curated, cleaned dataset	Raw assay + phenotype + clinical context
Execute well-defined routine	EDA, QC, preprocessing — outliers? batch effects?
Result	Modeling + diagnostics + refinement loop
	Conclusion that drives translational or clinical decision

Domain	n	Sub-domains (examples)
Population genetics	21	Admixture & aDNA, history & genealogies, selection & mutation
Clinical, PGx & diagnostics	26	Variant interpretation, pharmacogenomics, prenatal/clinical risk
Statistical genetics	17	Association & correction, causal mapping
Quantitative genetics	17	Trait architecture, family/transmission effects, genomic selection
Regulatory omics	17	Regulatory QTLs, transcriptome structure, spatial/chromatin context
Cancer genomics	10	Somatic genomics, liquid biopsy
Functional genomics	9	Functional genomics workflows
Proteomics	7	Proteomics and biomarkers
Microbial genomics	3	Metagenomic genomics
Forensic genetics	2	Forensic genetics

json

{
  "answer": {
    "therapy_class_code": 1,
    "benefit_rd_pp": 12.4,
    "toxicity_dropout_risk_pp": 8.1,
    "net_clinical_utility_pp": 9.6
  },
  "reasoning": "Marginal structural Cox model; excluded prevalent users; 90-day efficacy lag..."
}

Model / setting	GeneBench-Pro pass rate
GPT-5.6 Sol (highest reasoning)	28.7%
GPT-5.6 Sol + Pro mode	31.5%
GPT-5.6 Sol (lowest reasoning)	Single digits
GPT-5 (when GeneBench work began)	Below 5%
GPT-5.2 (high reasoning comparison)	~5× fewer solves than GPT-5.6 Sol at high reasoning, using more tokens

Failure mode	Example
Data QC blindness	Ancestry swaps, C>T bias in ancient DNA, batch artifacts — agents "aren't cautious enough" (Lex Flagel, Gencove)
Wrong tool, right vibe	Conventional Cox when marginal structural models needed for treatment-confounder feedback
Partial progress	Observations without integrating into revised plan — novice pattern vs expert reframing
Solver contract sensitivity	Prompt wording changes which analyses appear permissible (Cyrillus Tan, NYGC)

	GeneBench-Pro	LifeSciBench
Focus	Computational biology research judgment	Industry biotech workflows
Tasks	129 synthetic genomics/QTL/PGx analyses	750 expert-written FDA, assay, regulatory tasks
Grading	Deterministic numerical targets	Rubric on free-response answers
Hero metric	GPT-5.6 Sol ~31.5%	GPT-Rosalind ~36% (different task distribution)
Open data	10 problems on Hugging Face	Paper + contributor program

Announced	June 30, 2026 — OpenAI Research
Scope	129 problems · 10 domains · 21 sub-domains
Skill tested	"Research taste" — QC, path choice, iteration, decision-ready conclusions
Human effort	20–40 hours per problem (~$4k–8k at $200/hr reviewer estimate)
Best score	GPT-5.6 Sol: 28.7% (highest reasoning) · 31.5% with Pro mode
Baseline progress	Original GeneBench: GPT-5 below 5% when benchmark work began
Data design	Synthetic with known causal structure — deterministic grading
Open release	10 questions on Hugging Face + web UI; 50-question subset → Artificial Analysis
Agent environment	Isolated workspace — Python, scientific stack, PLINK 2.0, genomics libs

GeneBench-Pro: OpenAI''s Research-Level Benchmark for Computational Biology Judgment

TL;DR

Related posts

OpenAI Audits SWE-Bench Pro: ~30% of Tasks Broken — Retracts Recommendation

OpenAI Codex July 24 Reveal: What Tibo Sottiaux Teased

Codex $HOME Deletion: GPT-5.6, Full Access, and Tibo's July 16 Investigation

The Benchmark Gap in Biology

129 Problems Across Computational Biology

Why Synthetic Data — Not Messy Historical Cohorts

Construction and Validation Workflow

What Agents Actually Receive

Results — GPT-5.6 Sol Leads, Room to Grow

Where Models Fail — The Inferential Loop

GeneBench-Pro vs LifeSciBench

Community Reaction (July 1, 2026)

Why It Matters Beyond the Leaderboard

Try It Yourself

TL;DR

Related posts

OpenAI Audits SWE-Bench Pro: ~30% of Tasks Broken — Retracts Recommendation

OpenAI Codex July 24 Reveal: What Tibo Sottiaux Teased

Codex $HOME Deletion: GPT-5.6, Full Access, and Tibo's July 16 Investigation

The Benchmark Gap in Biology

129 Problems Across Computational Biology

Why Synthetic Data — Not Messy Historical Cohorts

Construction and Validation Workflow

What Agents Actually Receive

Results — GPT-5.6 Sol Leads, Room to Grow

Where Models Fail — The Inferential Loop

GeneBench-Pro vs LifeSciBench

Community Reaction (July 1, 2026)

Why It Matters Beyond the Leaderboard

Try It Yourself

Related Reading