Wilhelm-Foundation/rare-archive-eval-rarearena-rdc
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
config_name: default
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: metadata
struct:
- name: case_id
dtype: string
- name: split
dtype: string
- name: disease_id
dtype: string
- name: patient_category
dtype: string
splits:
- name: test
num_examples: 4376
tags:
- rare-disease
- clinical-diagnostics
- rarearena_eval
- rare-ai-archive
- evaluation
- benchmark
task_categories:
- text-generation
- question-answering
language:
- en
size_categories:
- 1K<n<10K
license: cc-by-nc-sa-4.0
---
# RareArena RDC — Rare Disease Cases Evaluation Benchmark
4,376 clinical vignettes **with laboratory test results** across rare diseases for evaluating AI diagnostic reasoning with lab data. Part of the [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive).
> **Research use only.** This dataset is an evaluation benchmark for AI systems. It is NOT intended for clinical decision-making and should NOT be used as a diagnostic tool.
## Dataset Description
- **Repository:** [Wilhelm-Foundation/rare-archive-eval-rarearena-rdc](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc)
- **License:** CC BY-NC-SA 4.0
- **Version:** 0.1.0
- **Part of:** [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive) · [Complete Toolkit Collection](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851)
## How RDC Differs from RDS
| Feature | RDS | RDC |
|---------|-----|-----|
| **Records** | 8,562 | 4,376 |
| **Lab results** | No | **Yes** — vignettes include laboratory and diagnostic test results |
| **Case complexity** | Clinical presentation only | Clinical presentation + interpreted lab data |
| **Use case** | Evaluate clinical reasoning from symptoms | Evaluate reasoning from symptoms **and** test results |
RDC cases are generally more complex — they require the model to integrate laboratory findings with clinical presentation, closer to real-world diagnostic workflows.
## Ecosystem Context
RDC cases add a critical dimension to evaluation: **laboratory test results**. In the full agentic diagnostic system, models learn to invoke clinical tools like ClinVar and gnomAD, then interpret their results alongside clinical presentations. This dataset tests that capability — can the model integrate lab data into its diagnostic reasoning?
This maps directly to Stage 2 of the [4-stage training pipeline](https://github.com/Wilhelm-Foundation/rare-archive/blob/main/ARCHITECTURE.md), where models learn to invoke tools and interpret real API responses. RDC evaluation measures whether tool-augmented reasoning produces better differentials than clinical presentation alone.
Disease categories in this dataset map to the ontology's clustering scheme, enabling evaluation of [condition-specific model adapters](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1#condition-specific-models) alongside the foundation model.
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `messages` | list | Chat-format messages (system prompt, user vignette + test results, assistant diagnosis) |
| `metadata.case_id` | string | Unique case identifier |
| `metadata.split` | string | Source split (`rdc`) |
| `metadata.disease_id` | string | Orphanet disease ID |
| `metadata.patient_category` | string | Patient category (where available) |
### Data Splits
| Split | Records | Purpose |
|-------|---------|---------|
| test | 4,376 | Evaluation benchmark |
### Message Format
Each record follows the OpenAI chat format:
1. **System**: Expert rare disease diagnostician instruction
2. **User**: Clinical vignette **with laboratory/diagnostic test results**
3. **Assistant**: Expected diagnostic reasoning and differential
## Dataset Creation
### Source Data
Derived from the [RareArena](https://github.com/zhao-zy15/RareArena) RDC (Rare Disease Cases) benchmark. Cases are drawn from published case reports in medical literature, rewritten by GPT-4o. Test results are concatenated with clinical vignettes to form comprehensive presentations.
### Data Processing
1. Case reports sourced from published medical literature
2. Clinical vignettes + test results generated via GPT-4o (de-identification)
3. Test results concatenated to clinical vignettes in the user message
4. Formatted as OpenAI chat JSONL via `rare-archive-datasets v0.1.0` `parse_case()` (v3 format)
### PHI Status
**no_phi** — All vignettes are GPT-4o rewrites of published case reports. No real patient data.
## Intended Use
**Primary use**: Evaluation benchmark for rare disease diagnostic AI models, specifically testing the ability to integrate laboratory findings with clinical reasoning.
- Measure Top-K differential diagnosis accuracy with lab context
- Compare model performance: RDS-only vs RDC (does lab data improve accuracy?)
- Benchmark lab result interpretation in diagnostic reasoning
**NOT intended for**: Clinical decision-making, patient diagnosis, or model training.
**Part of the ecosystem flywheel**: RDC evaluation reveals how well models interpret clinical data — a key input for Arena evaluators who score Tool Usage as one of the [5 quality dimensions](https://github.com/Wilhelm-Foundation/rare-archive/blob/main/ARCHITECTURE.md#rlhf-feedback-loop).
## Bias, Risks & Limitations
### Known Biases
| Bias Category | Description | Impact |
|---------------|-------------|--------|
| **Publication bias** | Over-represents diseases with published case reports | Common rare diseases over-represented; ultra-rare conditions under-represented |
| **Language bias** | English-only dataset | Non-English medical literature excluded |
| **Geographic bias** | Cases from English-language publishing institutions | Western academic medical centers over-represented |
| **Lab availability bias** | Cases include specific lab tests | Biased toward well-resourced clinical settings with access to specialized testing |
| **Rewriting artifacts** | GPT-4o rewriting may alter clinical nuance | Lab value interpretation may be standardized |
### Risks
- **Not for clinical use**: Evaluation benchmark only
- **Lab interpretation limits**: AI performance on curated lab results does not reflect performance on real clinical lab reports
- **Smaller sample**: 4,376 cases provides less statistical power than RDS for rare disease subgroup analysis
### Limitations
- Monolingual (English only)
- Smaller than RDS (4,376 vs 8,562) — fewer diseases covered
- Lab results are text descriptions, not structured lab values
- No temporal ordering of tests within a case
## Loading the Dataset
### Using HuggingFace Datasets
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rdc")
# Access the test split
test = dataset["test"]
print(f"Total cases: {len(test)}")
# Inspect a case with lab results
example = test[0]
for msg in example["messages"]:
print(f"[{msg['role']}] {msg['content'][:150]}...")
```
### Using Pandas
```python
import pandas as pd
df = pd.read_parquet(
"hf://datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc/data/test-00000-of-00001.parquet"
)
print(f"Shape: {df.shape}")
```
### Comparing RDS vs RDC Performance
```python
from datasets import load_dataset
rds = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rds", split="test")
rdc = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rdc", split="test")
print(f"RDS cases: {len(rds)} (clinical vignettes only)")
print(f"RDC cases: {len(rdc)} (vignettes + lab results)")
# Find overlapping disease IDs
rds_diseases = set(x["metadata"]["disease_id"] for x in rds)
rdc_diseases = set(x["metadata"]["disease_id"] for x in rdc)
overlap = rds_diseases & rdc_diseases
print(f"Shared diseases: {len(overlap)}")
```
## Related Resources
| Resource | Link |
|----------|------|
| **Model** | [rare-archive-qwen-4b-sft-v1](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1) — 4B SFT model trained on rare disease cases |
| **RDS Benchmark** | [rare-archive-eval-rarearena-rds](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds) — 8,562 clinical vignettes (no lab data) |
| **Training Data** | [rare-archive-synthetic-patients](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients) — 12,984 synthetic vignettes for SFT |
| **Clinical Demo** | [rare-archive-clinical-demo](https://huggingface.co/spaces/Wilhelm-Foundation/rare-archive-clinical-demo) — Interactive demo Space |
| **Collection** | [Rare AI Archive — Complete Toolkit](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851) |
| **GitHub** | [Wilhelm-Foundation/rare-archive](https://github.com/Wilhelm-Foundation/rare-archive) |
## Citation
```bibtex
@misc{rarearena2024,
title={RareArena: A Benchmark for Rare Disease Diagnosis},
author={Zhao, Zhiyu and others},
year={2024},
url={https://github.com/zhao-zy15/RareArena}
}
```
*A program of the [Wilhelm Foundation](https://wilhelm.foundation). Built on [Lattice Protocol](https://github.com/LatticeProtocol). No disease is too rare to matter.*
提供机构:
Wilhelm-Foundation



