WitnessDataFactory/oncology-1k
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/WitnessDataFactory/oncology-1k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
- token-classification
- question-answering
language:
- en
tags:
- medical
- healthcare
- synthetic-data
- nlp
- oncology
- clinical-ai
- hipaa-compliant
- medical-nlp
- healthcare-ai
- electronic-health-records
- labeled-data
pretty_name: Oncology Medical Dataset (1K Free Sample)
size_categories:
- 1K<n<10K
---
# Oncology Medical Dataset — 1,000 Record Free Sample
> **Enterprise-grade synthetic medical data. Zero PHI. 100% HIPAA-compliant.**
[](https://creativecommons.org/licenses/by/4.0/)
[](https://https://witness-data-factory.onrender.com)
[](https://https://witness-data-factory.onrender.com)
---
## Quality Metrics
| Metric | Score | Industry Benchmark |
|--------|-------|---------------------|
| **Trinity Consensus Score (TAS)** | 98.0% | 85-92% typical |
| **Inter-Annotator Agreement** | 0.97 | 0.75-0.85 typical |
| **Macro F1** | 0.97 | 0.80-0.90 typical |
| **PHI Present** | None | -- |
| **Generation Method** | 3-LLM Trinity Ensemble | Single model typical |
---
## What's Included (Free)
- **1,000 clinically-structured synthetic oncology records**
- Full label taxonomy with confidence scores per record
- Trinity consensus scores per record (filter by your own threshold)
- Structured Parquet format (load with Hugging Face `datasets` in one line)
- Zero PHI -- safe for unrestricted research and commercial use
---
## Quick Start
```python
from datasets import load_dataset
# Load free 1K sample
ds = load_dataset("WitnessDataFactory/oncology-1k", split="train")
print(ds[0])
# Filter by quality gate
high_quality = ds.filter(lambda x: x["consensus_score"] >= 0.97)
print(f"Records passing 97% gate: {len(high_quality)}")
# Export to pandas
df = ds.to_pandas()
df.to_csv("oncology_sample.csv", index=False)
```
---
## Dataset Schema
```json
{
"record_id": "uuid-v4",
"domain": "oncology",
"category": "Specific clinical subcategory",
"note_type": "Clinical note type",
"patient_age": 42,
"patient_gender": "Female",
"primary_label": "diagnosis",
"labels": {
"primary": "diagnosis",
"category": "Subcategory name",
"confidence": 0.972
},
"consensus_score": 0.972,
"inter_annotator_agreement": 0.941,
"macro_f1": 0.963,
"model_scores": {
"llama3.3": 0.975,
"mistral": 0.968,
"qwen2.5": 0.972
},
"passes_quality_gate": true,
"generation_method": "Trinity_Ensemble_v2",
"phi_present": false,
"hipaa_compliant": true
}
```
---
## Upgrade to Production Scale
This 1K sample is your **proof-of-concept dataset**. When you're ready to train production models:
| Tier | Records | Price | Per-Record | Best For | Buy |
|------|---------|-------|------------|----------|-----|
| **Starter** | 10,000 | **$1,999** | $0.20 | Pilot deployment, MVP | [Buy Now](https://witness-data-factory.onrender.com/pay/oncology-10k) |
| **Production** | 50,000 | **$7,999** | $0.16 | Model training, Series C+ | [Buy Now](https://witness-data-factory.onrender.com/pay/oncology-50k) |
| **Enterprise** | 250,000 | **$29,999** | $0.12 | FDA-track, clinical AI | [Buy Now](https://witness-data-factory.onrender.com/pay/oncology-250k) |
| **Strategic** | 1,000,000 | **$99,999** | $0.10 | Multi-year partnerships | [Contact Sales](mailto:WitnessDataFactory@gmail.com) |
### Multi-Domain Bundles
| Bundle | Contents | Price | Discount |
|--------|----------|-------|---------|
| **3-Domain Bundle** | 50K x 3 domains of choice | **$19,999** | 17% off |
| **Complete Collection** | 50K x all 9 specialties | **$49,999** | 22% off |
[View All Bundles](https://witness-data-factory.onrender.com/pay/complete-collection-9x50k)
> **Delivery:** Instant checkout -> Full dataset delivered within 24 hours.
---
## Why WITNESS DATA FACTORY?
### Speed
Your research timeline shouldn't wait 3-6 months for custom data generation.
Production datasets delivered in **under 24 hours** from purchase.
### Quality
- **98.0% Trinity consensus** vs. 85-92% industry standard
- 3-LLM ensemble eliminates single-model hallucination bias
- Every record validated through Trinity quality gates before delivery
- Documented, reproducible QA certificate included with every order
### Scale
- Proven on **100M+ record PostgreSQL infrastructure**
- Billion-record architecture ready for enterprise contracts
- 9 medical domains, 4 volume tiers, instant zero-touch fulfillment
### Compliance
- **Zero PHI** -- 100% synthetic, no de-identification liability
- HIPAA-compliant by architecture (no real patient data ever ingested)
- No IRB required -- fully synthetic generation pipeline
- Commercial use permitted under CC BY 4.0 (sample tier)
---
## Citation
```bibtex
@dataset{witness_data_factory_oncology_2026,
title = {Oncology Synthetic Medical Dataset},
author = {WITNESS DATA FACTORY},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/WitnessDataFactory/oncology-1k}
}
```
---
## Contact
| Channel | Address |
|---------|---------|
| Sales and Licensing | [WitnessDataFactory@gmail.com](mailto:WitnessDataFactory@gmail.com) |
| Technical Support | [WitnessDataFactory@gmail.com](mailto:WitnessDataFactory@gmail.com) |
| All Datasets | [huggingface.co/WitnessDataFactory](https://huggingface.co/WitnessDataFactory) |
| Store | [witness-data-factory.onrender.com](https://witness-data-factory.onrender.com) |
---
*Powered by **WITNESS DATA FACTORY** -- Enterprise Synthetic Medical Data at Scale*
*Trinity Ensemble Pipeline v3.2.1 | Zero PHI | Zero-Touch Fulfillment*
提供机构:
WitnessDataFactory
搜集汇总
数据集介绍

构建方式
在肿瘤学临床人工智能研究领域,数据稀缺与隐私合规性构成显著挑战。该数据集采用创新的三模型集成方法构建,通过结合Llama、Mistral和Qwen三种大型语言模型的生成能力,协同合成具有临床结构化的肿瘤学记录。每一份记录均经过严格的共识评分机制验证,确保生成内容具备高度的临床合理性与一致性,从而在完全规避真实患者健康信息的前提下,生成大规模、高质量的合成数据。
特点
本数据集的核心特征在于其卓越的质量保障与合规性设计。其Trinity共识评分高达98.0%,远超行业平均水平,且标注者间一致性系数达到0.97,确保了标签的可靠性。数据集完全由合成数据构成,不含任何真实受保护健康信息,从根本上符合HIPAA法规要求,为研究和商业应用提供了无法律风险的坚实基础。此外,每条记录均附带丰富的元数据,包括置信度评分、模型贡献分解及质量门控状态,为精细化数据筛选与分析提供了可能。
使用方法
该数据集旨在服务于医疗自然语言处理模型的开发与验证。用户可通过Hugging Face `datasets`库单行代码加载数据,并利用内置的共识评分字段轻松过滤出高质量子集,例如筛选评分高于0.97的记录以用于关键任务。数据以结构化Parquet格式提供,可无缝转换为Pandas DataFrame进行后续分析或导出为CSV等通用格式。作为概念验证样本,它也为评估更大规模生产级数据集的适用性提供了便捷入口。
背景与挑战
背景概述
在临床人工智能与自然语言处理交叉领域,高质量、合规的医学数据是推动精准医疗与辅助诊断模型发展的基石。Oncology-1k数据集由WITNESS DATA FACTORY于2026年构建并发布,专注于肿瘤学专科,旨在为医学文本分类、实体识别及问答任务提供企业级合成数据。该数据集通过创新的三模型集成生成技术,确保了零受保护健康信息属性,完全符合HIPAA合规标准,其核心研究问题在于解决真实临床数据因隐私法规限制而难以获取与共享的困境,为医疗人工智能模型的训练与验证提供了安全、可扩展的数据基础,对加速临床决策支持系统的研发具有显著影响力。
当前挑战
该数据集致力于应对肿瘤学领域临床文本处理的复杂挑战,包括医学实体与关系的精准抽取、多类别诊断与治疗信息的细粒度分类,以及基于电子健康记录的问答系统构建。在数据构建过程中,主要挑战在于生成既高度逼真又完全不含敏感个人信息的合成记录,这要求生成模型在保持临床术语准确性与逻辑连贯性的同时,彻底避免隐私泄露风险。此外,确保合成数据在统计分布与临床有效性上能够替代真实数据,并建立如Trinity共识评分等 rigorous 的质量评估体系,以保障其对于下游人工智能任务的实际效用,亦是构建过程中的关键难题。
常用场景
经典使用场景
在肿瘤学临床人工智能领域,高质量标注数据的稀缺性长期制约着自然语言处理模型的开发与验证。Oncology-1k数据集以其合成的临床记录和丰富的标注信息,为研究者提供了一个经典的使用场景:构建和评估针对肿瘤学文本的分类、实体识别和问答系统。该数据集的结构化格式与置信度评分,使得机器学习模型能够在模拟真实临床文档的环境中接受训练与测试,从而加速医疗NLP技术的迭代与优化。
实际应用
在实际医疗场景中,Oncology-1k数据集能够直接支持临床工作流程的智能化辅助工具开发。例如,基于该数据训练的模型可用于自动解析肿瘤学临床笔记,提取关键诊断信息、治疗反应或不良反应实体,进而集成到电子健康记录系统中,辅助医生进行更高效的病历回顾与患者管理。其合成且合规的特性,使得医疗机构和科技公司能够在无需担忧数据泄露风险的前提下,快速原型化和部署符合监管要求的AI应用。
衍生相关工作
围绕Oncology-1k这类高质量合成医疗数据集,已衍生出多项经典研究工作。这些工作主要集中在合成数据生成方法的优化、跨领域医疗NLP模型的迁移学习,以及合成数据在模型公平性与鲁棒性评估中的应用。例如,研究者利用其作为基准,对比不同生成式人工智能在创建临床文本时的保真度与偏差;亦有工作探索如何将在此数据集上预训练的模型,适配至其他医学专科,以验证合成数据在提升模型泛化能力方面的潜力。
以上内容由遇见数据集搜集并总结生成



