WitnessDataFactory/rare_disease-1k

Name: WitnessDataFactory/rare_disease-1k
Creator: WitnessDataFactory
Published: 2026-04-10 18:18:48
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/WitnessDataFactory/rare_disease-1k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification - token-classification - question-answering language: - en tags: - medical - healthcare - synthetic-data - nlp - rare_disease - clinical-ai - hipaa-compliant - medical-nlp - healthcare-ai - electronic-health-records - labeled-data pretty_name: Rare Disease Medical Dataset (1K Free Sample) size_categories: - 1K<n<10K --- # Rare Disease Medical Dataset — 1,000 Record Free Sample > **Enterprise-grade synthetic medical data. Zero PHI. 100% HIPAA-compliant.** [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) [![Quality Gate](https://img.shields.io/badge/Trinity_TAS-98.0%-brightgreen)](https://https://witness-data-factory.onrender.com) [![PHI Status](https://img.shields.io/badge/PHI-Zero-blue)](https://https://witness-data-factory.onrender.com) --- ## Quality Metrics | Metric | Score | Industry Benchmark | |--------|-------|---------------------| | **Trinity Consensus Score (TAS)** | 98.0% | 85-92% typical | | **Inter-Annotator Agreement** | 0.97 | 0.75-0.85 typical | | **Macro F1** | 0.97 | 0.80-0.90 typical | | **PHI Present** | None | -- | | **Generation Method** | 3-LLM Trinity Ensemble | Single model typical | --- ## What's Included (Free) - **1,000 clinically-structured synthetic rare disease records** - Full label taxonomy with confidence scores per record - Trinity consensus scores per record (filter by your own threshold) - Structured Parquet format (load with Hugging Face `datasets` in one line) - Zero PHI -- safe for unrestricted research and commercial use --- ## Quick Start ```python from datasets import load_dataset # Load free 1K sample ds = load_dataset("WitnessDataFactory/rare_disease-1k", split="train") print(ds[0]) # Filter by quality gate high_quality = ds.filter(lambda x: x["consensus_score"] >= 0.97) print(f"Records passing 97% gate: {len(high_quality)}") # Export to pandas df = ds.to_pandas() df.to_csv("rare_disease_sample.csv", index=False) ``` --- ## Dataset Schema ```json { "record_id": "uuid-v4", "domain": "rare_disease", "category": "Specific clinical subcategory", "note_type": "Clinical note type", "patient_age": 42, "patient_gender": "Female", "primary_label": "diagnosis", "labels": { "primary": "diagnosis", "category": "Subcategory name", "confidence": 0.972 }, "consensus_score": 0.972, "inter_annotator_agreement": 0.941, "macro_f1": 0.963, "model_scores": { "llama3.3": 0.975, "mistral": 0.968, "qwen2.5": 0.972 }, "passes_quality_gate": true, "generation_method": "Trinity_Ensemble_v2", "phi_present": false, "hipaa_compliant": true } ``` --- ## Upgrade to Production Scale This 1K sample is your **proof-of-concept dataset**. When you're ready to train production models: | Tier | Records | Price | Per-Record | Best For | Buy | |------|---------|-------|------------|----------|-----| | **Starter** | 10,000 | **$1,999** | $0.20 | Pilot deployment, MVP | [Buy Now](https://witness-data-factory.onrender.com/pay/rare_disease-10k) | | **Production** | 50,000 | **$7,999** | $0.16 | Model training, Series C+ | [Buy Now](https://witness-data-factory.onrender.com/pay/rare_disease-50k) | | **Enterprise** | 250,000 | **$29,999** | $0.12 | FDA-track, clinical AI | [Buy Now](https://witness-data-factory.onrender.com/pay/rare_disease-250k) | | **Strategic** | 1,000,000 | **$99,999** | $0.10 | Multi-year partnerships | [Contact Sales](mailto:WitnessDataFactory@gmail.com) | ### Multi-Domain Bundles | Bundle | Contents | Price | Discount | |--------|----------|-------|---------| | **3-Domain Bundle** | 50K x 3 domains of choice | **$19,999** | 17% off | | **Complete Collection** | 50K x all 9 specialties | **$49,999** | 22% off | [View All Bundles](https://witness-data-factory.onrender.com/pay/complete-collection-9x50k) > **Delivery:** Instant checkout -> Full dataset delivered within 24 hours. --- ## Why WITNESS DATA FACTORY? ### Speed Your research timeline shouldn't wait 3-6 months for custom data generation. Production datasets delivered in **under 24 hours** from purchase. ### Quality - **98.0% Trinity consensus** vs. 85-92% industry standard - 3-LLM ensemble eliminates single-model hallucination bias - Every record validated through Trinity quality gates before delivery - Documented, reproducible QA certificate included with every order ### Scale - Proven on **100M+ record PostgreSQL infrastructure** - Billion-record architecture ready for enterprise contracts - 9 medical domains, 4 volume tiers, instant zero-touch fulfillment ### Compliance - **Zero PHI** -- 100% synthetic, no de-identification liability - HIPAA-compliant by architecture (no real patient data ever ingested) - No IRB required -- fully synthetic generation pipeline - Commercial use permitted under CC BY 4.0 (sample tier) --- ## Citation ```bibtex @dataset{witness_data_factory_rare_disease_2026, title = {Rare Disease Synthetic Medical Dataset}, author = {WITNESS DATA FACTORY}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/WitnessDataFactory/rare_disease-1k} } ``` --- ## Contact | Channel | Address | |---------|---------| | Sales and Licensing | [WitnessDataFactory@gmail.com](mailto:WitnessDataFactory@gmail.com) | | Technical Support | [WitnessDataFactory@gmail.com](mailto:WitnessDataFactory@gmail.com) | | All Datasets | [huggingface.co/WitnessDataFactory](https://huggingface.co/WitnessDataFactory) | | Store | [witness-data-factory.onrender.com](https://witness-data-factory.onrender.com) | --- *Powered by **WITNESS DATA FACTORY** -- Enterprise Synthetic Medical Data at Scale* *Trinity Ensemble Pipeline v3.2.1 | Zero PHI | Zero-Touch Fulfillment*

提供机构：

WitnessDataFactory

搜集汇总

数据集介绍

构建方式

在罕见病研究领域，高质量临床数据的稀缺性长期制约着医学人工智能的发展。本数据集采用创新的Trinity集成生成方法，通过三个大型语言模型协同构建合成医疗记录，有效规避了单一模型可能产生的幻觉偏差。生成流程严格遵循临床结构规范，每条记录均经过共识评分验证，确保数据在模拟真实世界罕见病案例的同时，完全不含受保护的健康信息，从架构层面实现了与HIPAA标准的对齐。

使用方法

研究人员可利用Hugging Face的`datasets`库便捷加载此数据集，通过一行代码即可访问全部内容。数据集支持基于共识评分等质量指标进行灵活过滤，便于用户根据研究需求提取高置信度子集。数据可无缝转换为pandas DataFrame，以支持后续的分析、可视化或模型训练流程。该免费样本旨在作为概念验证，为后续扩展至包含数万至百万条记录的生产级数据集提供坚实基础。

背景与挑战

背景概述

在临床医学与人工智能交叉领域，罕见病研究长期面临高质量标注数据稀缺的困境，制约了诊断模型与自然语言处理技术的进展。由WITNESS DATA FACTORY于2026年构建的Rare Disease Medical Dataset（rare_disease-1k）应运而生，旨在通过合成数据技术生成符合HIPAA标准的临床记录，为零PHI风险的医学研究提供支持。该数据集聚焦于罕见病领域的文本分类、实体识别与问答任务，其采用的三模型集成生成与质量评估体系，为医疗AI模型训练提供了可靠且可扩展的数据基础，推动了合成数据在合规性要求严格的医疗场景中的应用。

当前挑战

罕见病数据集的构建需应对双重挑战：在领域问题层面，罕见病病例稀少且临床表现异质性高，导致传统数据收集方法难以获得足量、多样化的标注样本，限制了监督学习模型的泛化能力与鲁棒性。在构建过程中，合成数据的生成必须平衡临床真实性与隐私安全，避免引入患者健康信息（PHI），同时需通过多模型共识机制控制幻觉偏差，确保标签的一致性与准确性，这对数据生成管道的可靠性与质量评估体系提出了极高要求。

常用场景

经典使用场景

在罕见病临床研究领域，高质量标注数据的稀缺性长期制约着自然语言处理模型的开发与验证。该数据集通过提供结构化合成临床记录，为文本分类、实体识别和问答系统等任务构建了基准测试平台。研究人员可借助其丰富标签与置信度评分，训练模型精准识别罕见病相关诊断与症状，评估算法在医疗文本中的泛化能力与鲁棒性。

解决学术问题

该数据集有效缓解了罕见病研究中因患者隐私限制导致的数据匮乏困境。其合成性质规避了真实患者信息泄露风险，使学者能够无伦理负担地探索疾病模式识别、临床决策支持等关键问题。通过提供高一致性的标注数据，它促进了医疗自然语言处理模型的可重复性研究，为算法公平性与偏差分析提供了可靠基础。

实际应用

在医疗人工智能落地场景中，该数据集可直接用于开发临床文档自动化处理工具。医疗机构可基于其训练模型实现罕见病病例的智能筛查与分类，辅助医生提升诊断效率。制药企业亦能利用这些数据挖掘疾病关联特征，加速靶向药物研发进程，同时确保全流程符合HIPAA等医疗数据合规要求。

数据集最近研究