jang1563/bioreview-bench
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jang1563/bioreview-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
task_categories:
- text-classification
- text-generation
tags:
- peer-review
- biomedical
- benchmark
- scientific-review
- elife
- plos
- f1000research
- peerj
- nature
- rebuttal
- open-peer-review
pretty_name: "BioReview-Bench"
size_categories:
- 1K<n<10K
configs:
- config_name: default
default: true
data_files:
- split: train
path: "data/default/train.jsonl"
- split: validation
path: "data/default/validation.jsonl"
- split: test
path: "data/default/test.jsonl"
- config_name: benchmark
data_files:
- split: train
path: "data/benchmark/train.jsonl"
- split: validation
path: "data/benchmark/validation.jsonl"
- split: test
path: "data/benchmark/test.jsonl"
- config_name: concerns_flat
data_files:
- split: train
path: "data/concerns_flat/train.jsonl"
- split: validation
path: "data/concerns_flat/validation.jsonl"
- split: test
path: "data/concerns_flat/test.jsonl"
- config_name: elife
data_files:
- split: train
path: "data/elife/train.jsonl"
- split: validation
path: "data/elife/validation.jsonl"
- split: test
path: "data/elife/test.jsonl"
- config_name: plos
data_files:
- split: train
path: "data/plos/train.jsonl"
- split: validation
path: "data/plos/validation.jsonl"
- split: test
path: "data/plos/test.jsonl"
- config_name: f1000
data_files:
- split: train
path: "data/f1000/train.jsonl"
- split: validation
path: "data/f1000/validation.jsonl"
- split: test
path: "data/f1000/test.jsonl"
- config_name: peerj
data_files:
- split: train
path: "data/peerj/train.jsonl"
- split: validation
path: "data/peerj/validation.jsonl"
- split: test
path: "data/peerj/test.jsonl"
- config_name: nature
data_files:
- split: train
path: "data/nature/train.jsonl"
- split: validation
path: "data/nature/validation.jsonl"
- split: test
path: "data/nature/test.jsonl"
dataset_info:
- config_name: default
splits:
- name: train
num_examples: 5387
- name: validation
num_examples: 953
- name: test
num_examples: 600
- config_name: benchmark
splits:
- name: train
num_examples: 5387
- name: validation
num_examples: 953
- name: test
num_examples: 600
- config_name: concerns_flat
splits:
- name: train
num_examples: 79121
- name: validation
num_examples: 14101
- name: test
num_examples: 8647
- config_name: elife
splits:
- name: train
num_examples: 1409
- name: validation
num_examples: 251
- name: test
num_examples: 150
- config_name: plos
splits:
- name: train
num_examples: 1349
- name: validation
num_examples: 238
- name: test
num_examples: 150
- config_name: f1000
splits:
- name: train
num_examples: 2149
- name: validation
num_examples: 380
- name: test
num_examples: 150
- config_name: peerj
splits:
- name: train
num_examples: 165
- name: validation
num_examples: 29
- name: test
num_examples: 50
- config_name: nature
splits:
- name: train
num_examples: 315
- name: validation
num_examples: 55
- name: test
num_examples: 100
---
# BioReview-Bench
A benchmark and training dataset for AI-assisted biomedical peer review.
- **6,940 articles** with **101,869 reviewer concerns**
- Sources: elife (1810), f1000 (2679), nature (470), peerj (244), plos (1737)
- Concern-level labels: 9 categories, 3 severity levels, 5 author stance types
- License: benchmark metadata CC-BY-NC-4.0 | source content follows per-source terms | code Apache-2.0
## What makes this dataset unique
No other publicly available dataset provides **structured, concern-level
peer review data** for biomedical papers with:
- Categorised reviewer concerns (design flaw, statistical methodology, etc.)
- Severity labels (major / minor / optional)
- Author response tracking (conceded / rebutted / partial / unclear / no_response)
- Evidence-of-change flags
## Configs
| Config | Total rows | Total concerns |
|--------|-----------|---------------|
| `default` | 6,940 | 101,869 |
| `benchmark` | 6,940 | 93,222 |
| `concerns_flat` | 101,869 | 101,869 |
| `elife` | 1,810 | 11,772 |
| `plos` | 1,737 | 33,160 |
| `f1000` | 2,679 | 45,248 |
| `peerj` | 244 | 5,003 |
| `nature` | 470 | 6,686 |
- **`default`**: Full data — all fields, all sources. Use for analysis and research.
- **`benchmark`**: Task input format for AI review tool evaluation. Train/val include
simplified concerns (text + category + severity). Test split has `concerns=[]` to
prevent label leakage.
- **`concerns_flat`**: One row per concern with article context. Ideal for rebuttal
generation training and stance classification. PLOS entries included (filter with
`author_stance != "no_response"` for rebuttal tasks).
- **`elife`** / **`plos`** / **`f1000`** / **`peerj`** / **`nature`**: Source-specific subsets of `default`.
## Quick start
```python
from datasets import load_dataset
# Full dataset (default config)
ds = load_dataset("jang1563/bioreview-bench")
# Benchmark evaluation — test split has no concerns (your tool generates them)
ds = load_dataset("jang1563/bioreview-bench", "benchmark")
for article in ds["test"]:
text = article["paper_text_sections"]
# ... run your review tool, then evaluate with bioreview_bench.evaluate.metrics
# Training a review generation model
ds = load_dataset("jang1563/bioreview-bench", "benchmark")
for article in ds["train"]:
target_concerns = article["concerns"] # [{concern_text, category, severity}]
# Rebuttal generation / stance classification
ds = load_dataset("jang1563/bioreview-bench", "concerns_flat")
for row in ds["train"]:
concern = row["concern_text"]
response = row["author_response_text"]
stance = row["author_stance"] # conceded / rebutted / partial / unclear / no_response
# Source-specific analysis
ds = load_dataset("jang1563/bioreview-bench", "elife")
```
## Schema
### Article fields (default config)
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Article ID (e.g. `elife:84798`) |
| `source` | string | Journal source (`elife`, `plos`, `f1000`, `peerj`, `nature`) |
| `doi` | string | Article DOI |
| `title` | string | Article title |
| `abstract` | string | Abstract text |
| `subjects` | list[string] | Subject areas |
| `published_date` | string | ISO date |
| `paper_text_sections` | dict | Section name → text |
| `decision_letter_raw` | string | Raw peer review text |
| `author_response_raw` | string | Raw author response |
| `concerns` | list[object] | Extracted reviewer concerns |
### Concern fields
| Field | Type | Description |
|-------|------|-------------|
| `concern_id` | string | Unique ID (e.g. `elife:84798:R1C3`) |
| `concern_text` | string | Reviewer's concern (10-2000 chars) |
| `category` | string | One of 9 types (see below) |
| `severity` | string | `major` / `minor` / `optional` |
| `author_response_text` | string | Author's response to this concern |
| `author_stance` | string | `conceded` / `rebutted` / `partial` / `unclear` / `no_response` |
| `evidence_of_change` | bool? | Whether author made revisions |
| `resolution_confidence` | float | LLM confidence (0.0-1.0) |
### Concern categories
`design_flaw`, `statistical_methodology`, `missing_experiment`, `figure_issue`,
`prior_art_novelty`, `writing_clarity`, `reagent_method_specificity`,
`interpretation`, `other`
## Leaderboard (test split)
| Rank | Tool | Version | Recall | Precision | F1 | Major Recall |
|------|------|---------|--------|-----------|-----|--------------|
| 1 | Haiku-4.5 | claude-haiku-4-5-20251001 | 0.725 | 0.675 | 0.699 | 0.872 |
| 2 | GPT-4o-mini | gpt-4o-mini | 0.684 | 0.703 | 0.694 | 0.840 |
| 3 | Gemini-2.5-Flash | gemini-2.5-flash | 0.665 | 0.709 | 0.686 | 0.832 |
| 4 | BM25 | bm25-specter2 | 0.637 | 0.741 | 0.685 | 0.794 |
| 5 | Gemini-Flash-Lite | gemini-2.5-flash-lite | 0.615 | 0.708 | 0.658 | 0.781 |
| 6 | Llama-3.3-70B | llama-3.3-70b | 0.554 | 0.794 | 0.653 | 0.753 |
> Matching: SPECTER2 cosine similarity, threshold=0.65, Hungarian bipartite matching.
> Figure-issue concerns excluded. 944 scored articles.
> Submit results via [GitHub](https://github.com/jang1563/bioreview-bench).
## License
- **Benchmark annotations and packaging metadata**: CC-BY-NC-4.0.
- **Underlying article, review, and author-response content**: source-specific.
Redistribution is not uniform across all sources; follow `LICENSE_MATRIX.md`
in the GitHub repository and the original publisher terms.
- **Code** (Python package, evaluation harness): Apache-2.0.
See the [GitHub repository](https://github.com/jang1563/bioreview-bench) for
full license details.
## Citation
If you use this dataset, please cite:
```bibtex
@misc{bioreview-bench,
title={BioReview-Bench: A Benchmark for AI-Assisted Biomedical Peer Review},
author={Kim, JangKeun},
year={2026},
url={https://huggingface.co/datasets/jang1563/bioreview-bench}
}
```
提供机构:
jang1563



