five

intelmedica/general-medical-sentences-1

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/intelmedica/general-medical-sentences-1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en task_categories: - text-generation - automatic-speech-recognition tags: - medical - clinical - general-medical - synthetic - asr-training - sentence-generation - clinical-documentation pretty_name: "IntelMedica General Medical Sentences v1" size_categories: - 100K<n<1M dataset_info: features: - name: text dtype: string - name: category dtype: string - name: source_api dtype: string - name: term dtype: string - name: audience dtype: string splits: - name: train num_examples: 219412 - name: validation num_examples: 47017 - name: test num_examples: 47018 --- # IntelMedica General Medical Sentences v1 Synthetic general medical terminology for broad clinical use sentences for training medical Automatic Speech Recognition (ASR) models. Part of the [IntelMedica](https://intelmedica.ai) open-source medical AI initiative. ## Overview | Stat | Value | |------|-------| | **Total rows** | 313,447 | | **Train** | 219,412 | | **Validation** | 47,017 | | **Test** | 47,018 | | **Split ratio** | 70 / 15 / 15 (stratified by category) | | **Language** | English | | **Audience** | General | ## Category Distribution | Category | Train | Val | Test | Total | |----------|------:|----:|-----:|------:| | condition | 79,923 | — | — | ~114,176 | | drug | 70,329 | — | — | ~100,470 | | procedure | 24,581 | — | — | ~35,116 | | substance | 9,745 | — | — | ~13,921 | | finding | 7,593 | — | — | ~10,847 | | side_effect | 6,180 | — | — | ~8,829 | | procedure_doc | 4,661 | — | — | ~6,659 | | anatomy | 3,173 | — | — | ~4,533 | | lab_result | 2,903 | — | — | ~4,147 | | lab_value | 2,353 | — | — | ~3,361 | *23 categories total. Counts shown for train split; val/test follow same distribution.* ## Schema | Column | Type | Description | |--------|------|-------------| | `text` | string | The generated clinical sentence | | `category` | string | Clinical documentation category (e.g., sbar, hpi, soap_assessment) | | `source_api` | string | Origin API of the medical term used in generation | | `term` | string | The medical term the sentence was built around | | `audience` | string | Target audience: `general` | ## Data Sources Medical terms were collected from 11+ authoritative APIs and databases: | Source | Terms | Notes | |--------|------:|-------| | nci_thesaurus | 146,860 | NCI Thesaurus cancer/biomedical terms | | rxnorm | 88,536 | NLM RxNorm drug names | | snomed_ct | 29,677 | SNOMED CT clinical terms | | hcpcs | 11,694 | CMS HCPCS procedure codes | | cross_source | 10,805 | Multi-API combined terms | | fda | 9,648 | FDA drug/device data | | mesh | 8,036 | NLM MeSH medical subject headings | | dailymed | 5,286 | FDA DailyMed drug labels | | loinc | 2,453 | LOINC lab test codes | | abbreviations | 295 | Medical abbreviations (104K source) | | nursing_curated | 78 | Hand-curated nursing terms | | cms | 40 | CMS healthcare data | | nursing_physician | 39 | Cross-audience terms | ## Generation Pipeline 1. **Term collection** from 11 medical terminology APIs (RxNorm, SNOMED CT, NCI Thesaurus, MeSH, LOINC, DailyMed, HCPCS, FDA, CMS, plus curated nursing terms and 104K medical abbreviations) 2. **Quality cleaning** with 12 rules (deduplication, length filtering, encoding fixes, garbage removal) -- removed ~10% low-quality entries 3. **Template-based sentence generation** using Qwen 3.5 2B with audience-specific templates (general clinical scenarios) 4. **Stratified splitting** into 70/15/15 train/validation/test by category Full pipeline code: [intelmedica/med-speech-data-prep](https://github.com/intelmedica/med-speech-data-prep) ## Audio Versions Audio versions (TTS-synthesized at 16kHz, multi-speaker) coming soon: - `intelmedica/medical-tts-nursing-16khz` - `intelmedica/medical-tts-physician-16khz` - `intelmedica/medical-tts-general-16khz` ## Usage ```python from datasets import load_dataset ds = load_dataset("intelmedica/general-medical-sentences-1") print(ds) # DatasetDict({ # train: Dataset({features: ['text', 'category', 'source_api', 'term', 'audience'], num_rows: 219412}) # validation: Dataset({features: [...], num_rows: 47017}) # test: Dataset({features: [...], num_rows: 47018}) # }) print(ds["train"][0]) ``` ## Related Datasets - [jfmdai/medical-speech-data-collections](https://huggingface.co/datasets/jfmdai/medical-speech-data-collections) -- Field directory of all medical speech datasets - [jfmdai/nursing-sentences](https://huggingface.co/datasets/jfmdai/nursing-sentences) -- Original source (nursing) - [jfmdai/physician-sentences](https://huggingface.co/datasets/jfmdai/physician-sentences) -- Original source (physician) - [jfmdai/general-medical-sentences](https://huggingface.co/datasets/jfmdai/general-medical-sentences) -- Original source (general) ## Why `-1`? This is **version 1**. Future versions will incorporate: - Additional APIs (PubMed, RadLex, ClinicalTrials.gov) - Accent diversity via voice cloning - LLM-generated contextual clinical scenarios - Real-world correction-based improvements from deployed ASR systems ## License [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) ## Citation ```bibtex @dataset{general_medical_sentences_1, author = {Farooq, Junaid}, title = {IntelMedica General Medical Sentences v1}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/intelmedica/general-medical-sentences-1}, note = {Synthetic medical sentences for ASR training} } ``` ## Author **Junaid Farooq, MD** / [IntelMedica LLC](https://intelmedica.ai) / Physician-Led Open-Source Medical AI ## Disclaimer This dataset is for **research purposes only**. It is not a medical device, not Software as a Medical Device (SaMD), and not intended for clinical decision support. All data is **synthetic** -- no Protected Health Information (PHI) is present. Generated from publicly available medical terminology databases.
提供机构:
intelmedica
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作