Wilhelm-Foundation/rare-archive-eval-rarearena-rdc

Name: Wilhelm-Foundation/rare-archive-eval-rarearena-rdc
Creator: Wilhelm-Foundation
Published: 2026-03-26 11:21:19
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: config_name: default features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: metadata struct: - name: case_id dtype: string - name: split dtype: string - name: disease_id dtype: string - name: patient_category dtype: string splits: - name: test num_examples: 4376 tags: - rare-disease - clinical-diagnostics - rarearena_eval - rare-ai-archive - evaluation - benchmark task_categories: - text-generation - question-answering language: - en size_categories: - 1K<n<10K license: cc-by-nc-sa-4.0 --- # RareArena RDC — Rare Disease Cases Evaluation Benchmark 4,376 clinical vignettes **with laboratory test results** across rare diseases for evaluating AI diagnostic reasoning with lab data. Part of the [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive). > **Research use only.** This dataset is an evaluation benchmark for AI systems. It is NOT intended for clinical decision-making and should NOT be used as a diagnostic tool. ## Dataset Description - **Repository:** [Wilhelm-Foundation/rare-archive-eval-rarearena-rdc](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc) - **License:** CC BY-NC-SA 4.0 - **Version:** 0.1.0 - **Part of:** [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive) · [Complete Toolkit Collection](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851) ## How RDC Differs from RDS | Feature | RDS | RDC | |---------|-----|-----| | **Records** | 8,562 | 4,376 | | **Lab results** | No | **Yes** — vignettes include laboratory and diagnostic test results | | **Case complexity** | Clinical presentation only | Clinical presentation + interpreted lab data | | **Use case** | Evaluate clinical reasoning from symptoms | Evaluate reasoning from symptoms **and** test results | RDC cases are generally more complex — they require the model to integrate laboratory findings with clinical presentation, closer to real-world diagnostic workflows. ## Ecosystem Context RDC cases add a critical dimension to evaluation: **laboratory test results**. In the full agentic diagnostic system, models learn to invoke clinical tools like ClinVar and gnomAD, then interpret their results alongside clinical presentations. This dataset tests that capability — can the model integrate lab data into its diagnostic reasoning? This maps directly to Stage 2 of the [4-stage training pipeline](https://github.com/Wilhelm-Foundation/rare-archive/blob/main/ARCHITECTURE.md), where models learn to invoke tools and interpret real API responses. RDC evaluation measures whether tool-augmented reasoning produces better differentials than clinical presentation alone. Disease categories in this dataset map to the ontology's clustering scheme, enabling evaluation of [condition-specific model adapters](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1#condition-specific-models) alongside the foundation model. ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `messages` | list | Chat-format messages (system prompt, user vignette + test results, assistant diagnosis) | | `metadata.case_id` | string | Unique case identifier | | `metadata.split` | string | Source split (`rdc`) | | `metadata.disease_id` | string | Orphanet disease ID | | `metadata.patient_category` | string | Patient category (where available) | ### Data Splits | Split | Records | Purpose | |-------|---------|---------| | test | 4,376 | Evaluation benchmark | ### Message Format Each record follows the OpenAI chat format: 1. **System**: Expert rare disease diagnostician instruction 2. **User**: Clinical vignette **with laboratory/diagnostic test results** 3. **Assistant**: Expected diagnostic reasoning and differential ## Dataset Creation ### Source Data Derived from the [RareArena](https://github.com/zhao-zy15/RareArena) RDC (Rare Disease Cases) benchmark. Cases are drawn from published case reports in medical literature, rewritten by GPT-4o. Test results are concatenated with clinical vignettes to form comprehensive presentations. ### Data Processing 1. Case reports sourced from published medical literature 2. Clinical vignettes + test results generated via GPT-4o (de-identification) 3. Test results concatenated to clinical vignettes in the user message 4. Formatted as OpenAI chat JSONL via `rare-archive-datasets v0.1.0` `parse_case()` (v3 format) ### PHI Status **no_phi** — All vignettes are GPT-4o rewrites of published case reports. No real patient data. ## Intended Use **Primary use**: Evaluation benchmark for rare disease diagnostic AI models, specifically testing the ability to integrate laboratory findings with clinical reasoning. - Measure Top-K differential diagnosis accuracy with lab context - Compare model performance: RDS-only vs RDC (does lab data improve accuracy?) - Benchmark lab result interpretation in diagnostic reasoning **NOT intended for**: Clinical decision-making, patient diagnosis, or model training. **Part of the ecosystem flywheel**: RDC evaluation reveals how well models interpret clinical data — a key input for Arena evaluators who score Tool Usage as one of the [5 quality dimensions](https://github.com/Wilhelm-Foundation/rare-archive/blob/main/ARCHITECTURE.md#rlhf-feedback-loop). ## Bias, Risks & Limitations ### Known Biases | Bias Category | Description | Impact | |---------------|-------------|--------| | **Publication bias** | Over-represents diseases with published case reports | Common rare diseases over-represented; ultra-rare conditions under-represented | | **Language bias** | English-only dataset | Non-English medical literature excluded | | **Geographic bias** | Cases from English-language publishing institutions | Western academic medical centers over-represented | | **Lab availability bias** | Cases include specific lab tests | Biased toward well-resourced clinical settings with access to specialized testing | | **Rewriting artifacts** | GPT-4o rewriting may alter clinical nuance | Lab value interpretation may be standardized | ### Risks - **Not for clinical use**: Evaluation benchmark only - **Lab interpretation limits**: AI performance on curated lab results does not reflect performance on real clinical lab reports - **Smaller sample**: 4,376 cases provides less statistical power than RDS for rare disease subgroup analysis ### Limitations - Monolingual (English only) - Smaller than RDS (4,376 vs 8,562) — fewer diseases covered - Lab results are text descriptions, not structured lab values - No temporal ordering of tests within a case ## Loading the Dataset ### Using HuggingFace Datasets ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rdc") # Access the test split test = dataset["test"] print(f"Total cases: {len(test)}") # Inspect a case with lab results example = test[0] for msg in example["messages"]: print(f"[{msg['role']}] {msg['content'][:150]}...") ``` ### Using Pandas ```python import pandas as pd df = pd.read_parquet( "hf://datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc/data/test-00000-of-00001.parquet" ) print(f"Shape: {df.shape}") ``` ### Comparing RDS vs RDC Performance ```python from datasets import load_dataset rds = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rds", split="test") rdc = load_dataset("Wilhelm-Foundation/rare-archive-eval-rarearena-rdc", split="test") print(f"RDS cases: {len(rds)} (clinical vignettes only)") print(f"RDC cases: {len(rdc)} (vignettes + lab results)") # Find overlapping disease IDs rds_diseases = set(x["metadata"]["disease_id"] for x in rds) rdc_diseases = set(x["metadata"]["disease_id"] for x in rdc) overlap = rds_diseases & rdc_diseases print(f"Shared diseases: {len(overlap)}") ``` ## Related Resources | Resource | Link | |----------|------| | **Model** | [rare-archive-qwen-4b-sft-v1](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1) — 4B SFT model trained on rare disease cases | | **RDS Benchmark** | [rare-archive-eval-rarearena-rds](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds) — 8,562 clinical vignettes (no lab data) | | **Training Data** | [rare-archive-synthetic-patients](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients) — 12,984 synthetic vignettes for SFT | | **Clinical Demo** | [rare-archive-clinical-demo](https://huggingface.co/spaces/Wilhelm-Foundation/rare-archive-clinical-demo) — Interactive demo Space | | **Collection** | [Rare AI Archive — Complete Toolkit](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851) | | **GitHub** | [Wilhelm-Foundation/rare-archive](https://github.com/Wilhelm-Foundation/rare-archive) | ## Citation ```bibtex @misc{rarearena2024, title={RareArena: A Benchmark for Rare Disease Diagnosis}, author={Zhao, Zhiyu and others}, year={2024}, url={https://github.com/zhao-zy15/RareArena} } ``` *A program of the [Wilhelm Foundation](https://wilhelm.foundation). Built on [Lattice Protocol](https://github.com/LatticeProtocol). No disease is too rare to matter.*

提供机构：

Wilhelm-Foundation

5,000+

优质数据集

54 个

任务类型

进入经典数据集