Wilhelm-Foundation/rare-archive-eval-rarearena-rds

Name: Wilhelm-Foundation/rare-archive-eval-rarearena-rds
Creator: Wilhelm-Foundation
Published: 2026-03-26 11:21:18
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: config_name: default features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: metadata struct: - name: case_id dtype: string - name: split dtype: string - name: disease_id dtype: string - name: patient_category dtype: string splits: - name: test num_examples: 8562 tags: - rare-disease - clinical-diagnostics - rarearena_eval - rare-ai-archive - evaluation - benchmark task_categories: - text-generation - question-answering language: - en size_categories: - 1K<n<10K license: cc-by-nc-sa-4.0 --- # RareArena RDS — Rare Disease Specialists Evaluation Benchmark 8,562 clinical vignettes across 4,000+ rare diseases for evaluating AI diagnostic reasoning. Part of the [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive). > **Research use only.** This dataset is an evaluation benchmark for AI systems. It is NOT intended for clinical decision-making and should NOT be used as a diagnostic tool. ## Dataset Description - **Repository:** [Wilhelm-Foundation/rare-archive-eval-rarearena-rds](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds) - **License:** CC BY-NC-SA 4.0 - **Version:** 0.1.0 - **Part of:** [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive) · [Complete Toolkit Collection](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851) ## Ecosystem Context This evaluation benchmark measures how well models handle the **diagnostic reasoning patterns** that rare disease specialists use every day. These patterns — which tool to invoke, in what order, for which symptom constellations — are the exact traces captured during [Undiagnosed Patient Hackathons](https://www.nature.com/articles/d41586-026-00302-8) and structured by clinician validators. RDS vignettes test the core of the agentic system: can the model reason through a clinical presentation and produce a meaningful differential diagnosis? As the ecosystem grows, clinician evaluations in the [ELO Arena](https://github.com/Wilhelm-Foundation/rare-archive/blob/main/ARCHITECTURE.md) generate preference data that complements this benchmark — revealing not just *what* the model gets right, but *how well* it reasons. Disease categories in this dataset map to the ontology's clustering scheme, enabling evaluation of [condition-specific model adapters](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1#condition-specific-models) (IEM, Neuromuscular, Connective Tissue, and more) alongside the foundation model. ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `messages` | list | Chat-format messages (system prompt, user vignette, assistant diagnosis) | | `metadata.case_id` | string | Unique case identifier | | `metadata.split` | string | Source split (`rds`) | | `metadata.disease_id` | string | Orphanet disease ID | | `metadata.patient_category` | string | Patient category (where available) | ### Data Splits | Split | Records | Purpose | |-------|---------|---------| | test | 8,562 | Evaluation benchmark | ### Message Format Each record follows the OpenAI chat format with three messages: 1. **System**: Expert rare disease diagnostician instruction 2. **User**: Clinical vignette describing patient presentation 3. **Assistant**: Expected diagnostic reasoning and differential ## Dataset Creation ### Source Data Derived from the [RareArena](https://github.com/zhao-zy15/RareArena) RDS (Rare Disease Specialists) benchmark. Original cases are drawn from published case reports in PubMed Central (PMC), rewritten by GPT-4o to remove identifying information while preserving clinical detail. ### Data Processing 1. Case reports sourced from PMC literature 2. Clinical vignettes generated via GPT-4o rewriting (de-identification) 3. Formatted as OpenAI chat JSONL via `rare-archive-datasets v0.1.0` `parse_case()` (v3 format) 4. Published to HuggingFace Hub ### PHI Status **no_phi** — All vignettes are GPT-4o rewrites of published case reports. No real patient data is included. ## Intended Use **Primary use**: Evaluation benchmark for rare disease diagnostic AI models. - Measure Top-K differential diagnosis accuracy - Compare model performance across disease categories - Assess reasoning quality on clinical vignettes **Part of the ecosystem flywheel**: RDS evaluation results reveal where models need improvement → clinicians provide corrections in the Arena → corrections become new training data → the next model version is evaluated on this same benchmark. The cycle repeats. **NOT intended for**: Clinical decision-making, patient diagnosis, training data (use [synthetic-patients](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients) for SFT training). ## Bias, Risks & Limitations ### Known Biases | Bias Category | Description | Impact | |---------------|-------------|--------| | **Publication bias** | Over-represents diseases with published case reports in PMC | Common rare diseases over-represented; ultra-rare conditions under-represented | | **Language bias** | English-only dataset | Excludes non-English medical literature and clinical presentations | | **Geographic bias** | Cases drawn from institutions that publish in English-language journals | Western/academic medical center presentations over-represented | | **Rewriting artifacts** | GPT-4o rewriting may alter clinical nuance | Some diagnostic subtleties may be lost or standardized | | **Temporal bias** | Case reports span multiple decades | Older cases may reflect outdated diagnostic criteria or terminology | ### Risks - **Not for clinical use**: This is an evaluation benchmark, not a diagnostic tool - **Metric limitations**: Top-K accuracy on curated vignettes does not reflect real-world diagnostic performance - **Distribution mismatch**: Published case reports are not representative of clinical practice populations ### Limitations - Monolingual (English only) - No laboratory test results in RDS vignettes (see [RDC](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc) for cases with lab data) - Disease coverage limited to conditions with published PMC case reports - No severity or acuity annotations ## Loading the Dataset ### Using HuggingFace Datasets ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rds") # Access the test split test = dataset["test"] print(f"Total cases: {len(test)}") # Inspect a single case example = test[0] for msg in example["messages"]: print(f"[{msg['role']}] {msg['content'][:100]}...") print(f"Disease ID: {example['metadata']['disease_id']}") ``` ### Using Pandas ```python import pandas as pd # Load from parquet (faster) df = pd.read_parquet( "hf://datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds/data/test-00000-of-00001.parquet" ) print(f"Shape: {df.shape}") print(df.head()) ``` ### Filtering by Disease ```python from datasets import load_dataset dataset = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rds", split="test") # Filter to a specific Orphanet disease ID gaucher_cases = dataset.filter(lambda x: x["metadata"]["disease_id"] == "355") print(f"Gaucher disease cases: {len(gaucher_cases)}") ``` ## Related Resources | Resource | Link | |----------|------| | **Model** | [rare-archive-qwen-4b-sft-v1](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1) — 4B SFT model trained on rare disease cases | | **RDC Benchmark** | [rare-archive-eval-rarearena-rdc](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc) — 4,376 cases with laboratory test results | | **Training Data** | [rare-archive-synthetic-patients](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients) — 12,984 synthetic vignettes for SFT | | **Clinical Demo** | [rare-archive-clinical-demo](https://huggingface.co/spaces/Wilhelm-Foundation/rare-archive-clinical-demo) — Interactive demo Space | | **Collection** | [Rare AI Archive — Complete Toolkit](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851) | | **GitHub** | [Wilhelm-Foundation/rare-archive](https://github.com/Wilhelm-Foundation/rare-archive) | ## Citation ```bibtex @misc{rarearena2024, title={RareArena: A Benchmark for Rare Disease Diagnosis}, author={Zhao, Zhiyu and others}, year={2024}, url={https://github.com/zhao-zy15/RareArena} } ``` *A program of the [Wilhelm Foundation](https://wilhelm.foundation). Built on [Lattice Protocol](https://github.com/LatticeProtocol). No disease is too rare to matter.*

提供机构：

Wilhelm-Foundation

5,000+

优质数据集

54 个

任务类型

进入经典数据集