Wilhelm-Foundation/rare-archive-eval-rarearena-rds
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
config_name: default
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: metadata
struct:
- name: case_id
dtype: string
- name: split
dtype: string
- name: disease_id
dtype: string
- name: patient_category
dtype: string
splits:
- name: test
num_examples: 8562
tags:
- rare-disease
- clinical-diagnostics
- rarearena_eval
- rare-ai-archive
- evaluation
- benchmark
task_categories:
- text-generation
- question-answering
language:
- en
size_categories:
- 1K<n<10K
license: cc-by-nc-sa-4.0
---
# RareArena RDS — Rare Disease Specialists Evaluation Benchmark
8,562 clinical vignettes across 4,000+ rare diseases for evaluating AI diagnostic reasoning. Part of the [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive).
> **Research use only.** This dataset is an evaluation benchmark for AI systems. It is NOT intended for clinical decision-making and should NOT be used as a diagnostic tool.
## Dataset Description
- **Repository:** [Wilhelm-Foundation/rare-archive-eval-rarearena-rds](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds)
- **License:** CC BY-NC-SA 4.0
- **Version:** 0.1.0
- **Part of:** [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive) · [Complete Toolkit Collection](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851)
## Ecosystem Context
This evaluation benchmark measures how well models handle the **diagnostic reasoning patterns** that rare disease specialists use every day. These patterns — which tool to invoke, in what order, for which symptom constellations — are the exact traces captured during [Undiagnosed Patient Hackathons](https://www.nature.com/articles/d41586-026-00302-8) and structured by clinician validators.
RDS vignettes test the core of the agentic system: can the model reason through a clinical presentation and produce a meaningful differential diagnosis? As the ecosystem grows, clinician evaluations in the [ELO Arena](https://github.com/Wilhelm-Foundation/rare-archive/blob/main/ARCHITECTURE.md) generate preference data that complements this benchmark — revealing not just *what* the model gets right, but *how well* it reasons.
Disease categories in this dataset map to the ontology's clustering scheme, enabling evaluation of [condition-specific model adapters](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1#condition-specific-models) (IEM, Neuromuscular, Connective Tissue, and more) alongside the foundation model.
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `messages` | list | Chat-format messages (system prompt, user vignette, assistant diagnosis) |
| `metadata.case_id` | string | Unique case identifier |
| `metadata.split` | string | Source split (`rds`) |
| `metadata.disease_id` | string | Orphanet disease ID |
| `metadata.patient_category` | string | Patient category (where available) |
### Data Splits
| Split | Records | Purpose |
|-------|---------|---------|
| test | 8,562 | Evaluation benchmark |
### Message Format
Each record follows the OpenAI chat format with three messages:
1. **System**: Expert rare disease diagnostician instruction
2. **User**: Clinical vignette describing patient presentation
3. **Assistant**: Expected diagnostic reasoning and differential
## Dataset Creation
### Source Data
Derived from the [RareArena](https://github.com/zhao-zy15/RareArena) RDS (Rare Disease Specialists) benchmark. Original cases are drawn from published case reports in PubMed Central (PMC), rewritten by GPT-4o to remove identifying information while preserving clinical detail.
### Data Processing
1. Case reports sourced from PMC literature
2. Clinical vignettes generated via GPT-4o rewriting (de-identification)
3. Formatted as OpenAI chat JSONL via `rare-archive-datasets v0.1.0` `parse_case()` (v3 format)
4. Published to HuggingFace Hub
### PHI Status
**no_phi** — All vignettes are GPT-4o rewrites of published case reports. No real patient data is included.
## Intended Use
**Primary use**: Evaluation benchmark for rare disease diagnostic AI models.
- Measure Top-K differential diagnosis accuracy
- Compare model performance across disease categories
- Assess reasoning quality on clinical vignettes
**Part of the ecosystem flywheel**: RDS evaluation results reveal where models need improvement → clinicians provide corrections in the Arena → corrections become new training data → the next model version is evaluated on this same benchmark. The cycle repeats.
**NOT intended for**: Clinical decision-making, patient diagnosis, training data (use [synthetic-patients](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients) for SFT training).
## Bias, Risks & Limitations
### Known Biases
| Bias Category | Description | Impact |
|---------------|-------------|--------|
| **Publication bias** | Over-represents diseases with published case reports in PMC | Common rare diseases over-represented; ultra-rare conditions under-represented |
| **Language bias** | English-only dataset | Excludes non-English medical literature and clinical presentations |
| **Geographic bias** | Cases drawn from institutions that publish in English-language journals | Western/academic medical center presentations over-represented |
| **Rewriting artifacts** | GPT-4o rewriting may alter clinical nuance | Some diagnostic subtleties may be lost or standardized |
| **Temporal bias** | Case reports span multiple decades | Older cases may reflect outdated diagnostic criteria or terminology |
### Risks
- **Not for clinical use**: This is an evaluation benchmark, not a diagnostic tool
- **Metric limitations**: Top-K accuracy on curated vignettes does not reflect real-world diagnostic performance
- **Distribution mismatch**: Published case reports are not representative of clinical practice populations
### Limitations
- Monolingual (English only)
- No laboratory test results in RDS vignettes (see [RDC](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc) for cases with lab data)
- Disease coverage limited to conditions with published PMC case reports
- No severity or acuity annotations
## Loading the Dataset
### Using HuggingFace Datasets
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rds")
# Access the test split
test = dataset["test"]
print(f"Total cases: {len(test)}")
# Inspect a single case
example = test[0]
for msg in example["messages"]:
print(f"[{msg['role']}] {msg['content'][:100]}...")
print(f"Disease ID: {example['metadata']['disease_id']}")
```
### Using Pandas
```python
import pandas as pd
# Load from parquet (faster)
df = pd.read_parquet(
"hf://datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds/data/test-00000-of-00001.parquet"
)
print(f"Shape: {df.shape}")
print(df.head())
```
### Filtering by Disease
```python
from datasets import load_dataset
dataset = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rds", split="test")
# Filter to a specific Orphanet disease ID
gaucher_cases = dataset.filter(lambda x: x["metadata"]["disease_id"] == "355")
print(f"Gaucher disease cases: {len(gaucher_cases)}")
```
## Related Resources
| Resource | Link |
|----------|------|
| **Model** | [rare-archive-qwen-4b-sft-v1](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1) — 4B SFT model trained on rare disease cases |
| **RDC Benchmark** | [rare-archive-eval-rarearena-rdc](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc) — 4,376 cases with laboratory test results |
| **Training Data** | [rare-archive-synthetic-patients](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients) — 12,984 synthetic vignettes for SFT |
| **Clinical Demo** | [rare-archive-clinical-demo](https://huggingface.co/spaces/Wilhelm-Foundation/rare-archive-clinical-demo) — Interactive demo Space |
| **Collection** | [Rare AI Archive — Complete Toolkit](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851) |
| **GitHub** | [Wilhelm-Foundation/rare-archive](https://github.com/Wilhelm-Foundation/rare-archive) |
## Citation
```bibtex
@misc{rarearena2024,
title={RareArena: A Benchmark for Rare Disease Diagnosis},
author={Zhao, Zhiyu and others},
year={2024},
url={https://github.com/zhao-zy15/RareArena}
}
```
*A program of the [Wilhelm Foundation](https://wilhelm.foundation). Built on [Lattice Protocol](https://github.com/LatticeProtocol). No disease is too rare to matter.*
提供机构:
Wilhelm-Foundation



