intelmedica/general-medical-sentences-1
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/intelmedica/general-medical-sentences-1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
task_categories:
- text-generation
- automatic-speech-recognition
tags:
- medical
- clinical
- general-medical
- synthetic
- asr-training
- sentence-generation
- clinical-documentation
pretty_name: "IntelMedica General Medical Sentences v1"
size_categories:
- 100K<n<1M
dataset_info:
features:
- name: text
dtype: string
- name: category
dtype: string
- name: source_api
dtype: string
- name: term
dtype: string
- name: audience
dtype: string
splits:
- name: train
num_examples: 219412
- name: validation
num_examples: 47017
- name: test
num_examples: 47018
---
# IntelMedica General Medical Sentences v1
Synthetic general medical terminology for broad clinical use sentences for training medical Automatic Speech Recognition (ASR) models. Part of the [IntelMedica](https://intelmedica.ai) open-source medical AI initiative.
## Overview
| Stat | Value |
|------|-------|
| **Total rows** | 313,447 |
| **Train** | 219,412 |
| **Validation** | 47,017 |
| **Test** | 47,018 |
| **Split ratio** | 70 / 15 / 15 (stratified by category) |
| **Language** | English |
| **Audience** | General |
## Category Distribution
| Category | Train | Val | Test | Total |
|----------|------:|----:|-----:|------:|
| condition | 79,923 | — | — | ~114,176 |
| drug | 70,329 | — | — | ~100,470 |
| procedure | 24,581 | — | — | ~35,116 |
| substance | 9,745 | — | — | ~13,921 |
| finding | 7,593 | — | — | ~10,847 |
| side_effect | 6,180 | — | — | ~8,829 |
| procedure_doc | 4,661 | — | — | ~6,659 |
| anatomy | 3,173 | — | — | ~4,533 |
| lab_result | 2,903 | — | — | ~4,147 |
| lab_value | 2,353 | — | — | ~3,361 |
*23 categories total. Counts shown for train split; val/test follow same distribution.*
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `text` | string | The generated clinical sentence |
| `category` | string | Clinical documentation category (e.g., sbar, hpi, soap_assessment) |
| `source_api` | string | Origin API of the medical term used in generation |
| `term` | string | The medical term the sentence was built around |
| `audience` | string | Target audience: `general` |
## Data Sources
Medical terms were collected from 11+ authoritative APIs and databases:
| Source | Terms | Notes |
|--------|------:|-------|
| nci_thesaurus | 146,860 | NCI Thesaurus cancer/biomedical terms |
| rxnorm | 88,536 | NLM RxNorm drug names |
| snomed_ct | 29,677 | SNOMED CT clinical terms |
| hcpcs | 11,694 | CMS HCPCS procedure codes |
| cross_source | 10,805 | Multi-API combined terms |
| fda | 9,648 | FDA drug/device data |
| mesh | 8,036 | NLM MeSH medical subject headings |
| dailymed | 5,286 | FDA DailyMed drug labels |
| loinc | 2,453 | LOINC lab test codes |
| abbreviations | 295 | Medical abbreviations (104K source) |
| nursing_curated | 78 | Hand-curated nursing terms |
| cms | 40 | CMS healthcare data |
| nursing_physician | 39 | Cross-audience terms |
## Generation Pipeline
1. **Term collection** from 11 medical terminology APIs (RxNorm, SNOMED CT, NCI Thesaurus, MeSH, LOINC, DailyMed, HCPCS, FDA, CMS, plus curated nursing terms and 104K medical abbreviations)
2. **Quality cleaning** with 12 rules (deduplication, length filtering, encoding fixes, garbage removal) -- removed ~10% low-quality entries
3. **Template-based sentence generation** using Qwen 3.5 2B with audience-specific templates (general clinical scenarios)
4. **Stratified splitting** into 70/15/15 train/validation/test by category
Full pipeline code: [intelmedica/med-speech-data-prep](https://github.com/intelmedica/med-speech-data-prep)
## Audio Versions
Audio versions (TTS-synthesized at 16kHz, multi-speaker) coming soon:
- `intelmedica/medical-tts-nursing-16khz`
- `intelmedica/medical-tts-physician-16khz`
- `intelmedica/medical-tts-general-16khz`
## Usage
```python
from datasets import load_dataset
ds = load_dataset("intelmedica/general-medical-sentences-1")
print(ds)
# DatasetDict({
# train: Dataset({features: ['text', 'category', 'source_api', 'term', 'audience'], num_rows: 219412})
# validation: Dataset({features: [...], num_rows: 47017})
# test: Dataset({features: [...], num_rows: 47018})
# })
print(ds["train"][0])
```
## Related Datasets
- [jfmdai/medical-speech-data-collections](https://huggingface.co/datasets/jfmdai/medical-speech-data-collections) -- Field directory of all medical speech datasets
- [jfmdai/nursing-sentences](https://huggingface.co/datasets/jfmdai/nursing-sentences) -- Original source (nursing)
- [jfmdai/physician-sentences](https://huggingface.co/datasets/jfmdai/physician-sentences) -- Original source (physician)
- [jfmdai/general-medical-sentences](https://huggingface.co/datasets/jfmdai/general-medical-sentences) -- Original source (general)
## Why `-1`?
This is **version 1**. Future versions will incorporate:
- Additional APIs (PubMed, RadLex, ClinicalTrials.gov)
- Accent diversity via voice cloning
- LLM-generated contextual clinical scenarios
- Real-world correction-based improvements from deployed ASR systems
## License
[CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)
## Citation
```bibtex
@dataset{general_medical_sentences_1,
author = {Farooq, Junaid},
title = {IntelMedica General Medical Sentences v1},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/intelmedica/general-medical-sentences-1},
note = {Synthetic medical sentences for ASR training}
}
```
## Author
**Junaid Farooq, MD** / [IntelMedica LLC](https://intelmedica.ai) / Physician-Led Open-Source Medical AI
## Disclaimer
This dataset is for **research purposes only**. It is not a medical device, not Software as a Medical Device (SaMD), and not intended for clinical decision support. All data is **synthetic** -- no Protected Health Information (PHI) is present. Generated from publicly available medical terminology databases.
提供机构:
intelmedica



