five

SilencioNetwork/medical-speech-dataset

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/medical-speech-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en - es - de task_categories: - automatic-speech-recognition - text-to-speech tags: - medical-speech - healthcare-ai - clinical-documentation - voice-ai - speech-data - asr pretty_name: "Medical Speech Dataset" dataset_info: features: - name: file_name dtype: string - name: id dtype: int64 - name: gender dtype: string - name: ethnicity dtype: string - name: occupation dtype: string - name: birth_place dtype: string - name: mother_tongue dtype: string - name: dialect dtype: string - name: year_of_birth dtype: int64 - name: years_at_birth_place dtype: int64 - name: languages_data dtype: string - name: os dtype: string - name: device dtype: string - name: browser dtype: string - name: duration dtype: float64 - name: emotions dtype: string - name: language dtype: string - name: location dtype: string - name: noise_sources dtype: string - name: script_id dtype: int64 - name: type_of_script dtype: string - name: script dtype: string - name: transcript dtype: string - name: speaker_id dtype: string configs: - config_name: english_medical data_files: - split: medical path: english_medical/medical/** - config_name: global_medical data_files: - split: medical path: global_medical/medical/** size_categories: - n<1K --- # Medical Speech Dataset **A specialized speech dataset for healthcare AI applications featuring real medical terminology, clinical conversations, and domain-specific vocabulary.** This dataset is curated from the [complete-voiceai-speech-dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) and focuses specifically on medical domain speech data collected from real healthcare contexts. ## Dataset Overview - **Total audio files**: 33 recordings - **Total duration**: ~42 minutes - **Languages**: English (native) + Global Medical (multilingual) - **Domain**: Medical terminology, clinical documentation, patient-provider conversations - **Audio format**: WAV files - **Sample rate**: 48 kHz - **License**: CC BY-NC 4.0 (free for research, non-commercial use) ## Target Applications This dataset is designed for: - **Medical ASR systems** (ambient clinical documentation, medical dictation) - **Healthcare AI assistants** (Abridge, Suki, Nabla, Ambience Healthcare) - **Medical voice note transcription** - **Clinical conversation analysis** - **Medical terminology recognition models** - **Healthcare dialogue systems** ## Dataset Structure ``` medical-speech-dataset/ ├── english_medical/ │ └── medical/ │ ├── data/ # 8 audio files │ └── metadata.csv # Speaker metadata └── global_medical/ └── medical/ ├── data/ # 25 audio files └── metadata.csv # Speaker metadata ``` ## Data Splits ### English Medical (Native Speakers) - **Files**: 8 recordings - **Context**: Native English speakers discussing medical topics - **Use case**: High-accuracy medical ASR training, US/UK clinical documentation ### Global Medical (Multilingual) - **Files**: 25 recordings - **Context**: Medical speech from diverse linguistic backgrounds - **Use case**: Accent-robust medical ASR, global telehealth applications ## Key Features ✅ **Real medical terminology** - Conditions, medications, procedures, anatomical terms ✅ **Natural speech patterns** - Disfluencies, hesitations, clinical conversation flow ✅ **Diverse accents** - Global medical professionals and patients ✅ **Domain-specific vocabulary** - Not available in general speech datasets ✅ **Ethical data collection** - Consent-based, privacy-preserving ## Use Cases ### 1. Ambient Clinical Documentation Train models to transcribe doctor-patient conversations in real-time (similar to Abridge, Suki, Nabla). ### 2. Medical Dictation Systems Improve accuracy for physicians dictating clinical notes, discharge summaries, and prescriptions. ### 3. Telehealth Transcription Build ASR systems for virtual healthcare consultations across diverse accents and languages. ### 4. Medical Voice Assistants Develop voice-enabled healthcare tools for symptom checking, medication reminders, and patient education. ### 5. Clinical Research Analyze speech patterns in medical contexts, study communication dynamics between providers and patients. ## Loading the Dataset ```python from datasets import load_dataset # Load full dataset dataset = load_dataset("SilencioNetwork/medical-speech-dataset") # Load specific split english_medical = load_dataset("SilencioNetwork/medical-speech-dataset", data_dir="english_medical") global_medical = load_dataset("SilencioNetwork/medical-speech-dataset", data_dir="global_medical") ``` ## Sample Metadata Each recording includes: - `file_name`: Audio file identifier - `birth_place`: Speaker's country/region of origin - `language`: Primary language spoken - `context`: Medical (clinical terminology, healthcare conversations) ## Medical Speech Characteristics This dataset captures real-world medical speech features: - **Medical jargon**: "hypertension", "myocardial infarction", "differential diagnosis" - **Clinical abbreviations**: Spoken medical shorthand (BP, HR, PRN, etc.) - **Provider-patient dynamics**: Turn-taking, clarification requests, empathy markers - **Multilingual medical contexts**: Healthcare delivery across linguistic boundaries ## Ethical Considerations All data was collected with explicit informed consent. No protected health information (PHI) is included - all recordings contain general medical terminology only, not patient-specific data. ## Need More Medical Speech Data? This is a sample dataset from Silencio's larger Off-the-Shelf (OTS) medical speech inventory: 📊 **Available in full inventory:** - 300+ hours of medical domain speech - 15+ languages - Specialized domains: cardiology, radiology, surgery, pharmacy, etc. - Provider + patient perspectives **Contact us for access**: [alex@silencioai.com](mailto:alex@silencioai.com) ## Citation If you use this dataset in your research or commercial product, please cite: ```bibtex @dataset{silencio_medical_speech_2026, title={Medical Speech Dataset}, author={Silencio Network}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/SilencioNetwork/medical-speech-dataset} } ``` ## Related Datasets - [Complete Voice AI Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) - 39 language/accent variants - [Indian Languages Speech](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech) - 9 Indian languages - [European Languages Speech](https://huggingface.co/datasets/SilencioNetwork/european-languages-speech) - 5 European languages - [Global English Accents Speech](https://huggingface.co/datasets/SilencioNetwork/global-english-accents-speech) - 20 English accent variants ## License **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International) ✅ Free for research and non-commercial use ❌ Commercial use requires licensing (contact us) ## About Silencio Silencio is a voice AI data sourcing company with 2M+ contributors across 180+ countries. We provide scaled sourcing of real-world audio and speech data for AI labs, robotics companies, and healthcare AI developers. 🌐 [silenciai.com](https://silencioai.com) 📧 [sofia@silencioai.com](mailto:sofia@silencioai.com) --- **Tags**: medical speech, healthcare AI, clinical documentation, medical ASR, medical dictation, ambient scribe, domain-specific speech, medical terminology, healthcare NLP, voice health

license: cc-by-nc-4.0 language: - en - es - de task_categories: - 自动语音识别(ASR) - 文本转语音(text-to-speech) tags: - 医疗语音 - 医疗人工智能 - 临床文档记录 - 语音人工智能 - 语音数据 - 自动语音识别(ASR) pretty_name: "医疗语音数据集" dataset_info: features: - name: 文件名 dtype: 字符串 - name: 编号 dtype: 64位整数 - name: 性别 dtype: 字符串 - name: 种族 dtype: 字符串 - name: 职业 dtype: 字符串 - name: 出生地 dtype: 字符串 - name: 母语 dtype: 字符串 - name: 方言 dtype: 字符串 - name: 出生年份 dtype: 64位整数 - name: 在出生地居住时长 dtype: 64位整数 - name: 语言数据 dtype: 字符串 - name: 操作系统 dtype: 字符串 - name: 设备 dtype: 字符串 - name: 浏览器 dtype: 字符串 - name: 时长 dtype: 64位浮点数 - name: 情绪 dtype: 字符串 - name: 语言 dtype: 字符串 - name: 采集位置 dtype: 字符串 - name: 噪声源 dtype: 字符串 - name: 脚本编号 dtype: 64位整数 - name: 脚本类型 dtype: 字符串 - name: 脚本文本 dtype: 字符串 - name: 转录文本 dtype: 字符串 - name: 说话人ID dtype: 字符串 configs: - config_name: 英语医疗语料 data_files: - split: 医疗 path: english_medical/medical/** - config_name: 全球医疗语料 data_files: - split: 医疗 path: global_medical/medical/** size_categories: - n<1000 # 医疗语音数据集 **一款面向医疗人工智能应用的专业化语音数据集,涵盖真实医疗术语、临床对话与领域专属词汇。** 本数据集源自[完整语音AI语音数据集](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset),专门聚焦于从真实医疗场景中采集的医疗领域语音数据。 ## 数据集概览 - **总音频文件数**:33条录音 - **总时长**:约42分钟 - **语言**:英语(母语)+ 全球医疗多语言语料 - **应用领域**:医疗术语、临床文档记录、医患对话 - **音频格式**:WAV文件 - **采样率**:48 kHz - **许可证**:CC BY-NC 4.0(可免费用于研究与非商业用途) ## 目标应用场景 本数据集专为以下场景设计: - **医疗自动语音识别(ASR)系统**(用于实时临床文档记录、医疗口述转录) - **医疗人工智能助手**(如Abridge、Suki、Nabla、Ambience Healthcare) - **医疗语音笔记转录** - **临床对话分析** - **医疗术语识别模型** - **医疗对话系统** ## 数据集结构 medical-speech-dataset/ ├── english_medical/ │ └── medical/ │ ├── data/ # 8条音频文件 │ └── metadata.csv # 说话人元数据 └── global_medical/ └── medical/ ├── data/ # 25条音频文件 └── metadata.csv # 说话人元数据 ## 数据拆分 ### 英语医疗语料(母语使用者) - **文件数**:8条录音 - **场景**:以英语为母语的说话人讨论医疗话题 - **适用场景**:高精度医疗ASR训练、英美临床文档记录 ### 全球医疗语料(多语言) - **文件数**:25条录音 - **场景**:来自多元语言背景的医疗语音 - **适用场景**:支持多口音的医疗ASR、全球远程医疗应用 ## 核心特性 ✅ **真实医疗术语**——涵盖疾病、药物、诊疗操作与解剖学术语 ✅ **自然语音模式**——包含语音卡顿、犹豫与临床对话的自然流转 ✅ **多元口音**——涵盖全球医疗从业者与患者的语音 ✅ **领域专属词汇**——通用语音数据集未收录的医疗专属词汇 ✅ **合规数据采集**——基于知情同意,兼顾隐私保护 ## 应用场景 ### 1. 实时临床文档记录 训练模型实时转录医患对话(类似Abridge、Suki、Nabla等工具)。 ### 2. 医疗口述转录系统 提升医师口述临床记录、出院小结与处方的转录准确率。 ### 3. 远程医疗转录 为跨口音、跨语言的虚拟医疗会诊构建ASR系统。 ### 4. 医疗语音助手 开发用于症状自查、用药提醒与患者教育的语音医疗工具。 ### 5. 临床研究 分析医疗场景下的语音模式,研究医护人员与患者的沟通互动规律。 ## 数据集加载 python from datasets import load_dataset # 加载完整数据集 dataset = load_dataset("SilencioNetwork/medical-speech-dataset") # 加载指定拆分 english_medical = load_dataset("SilencioNetwork/medical-speech-dataset", data_dir="english_medical") global_medical = load_dataset("SilencioNetwork/medical-speech-dataset", data_dir="global_medical") ## 样本元数据 每条录音包含以下字段: - `file_name`: 音频文件标识符 - `birth_place`: 说话人原籍国家/地区 - `language`: 主要使用语言 - `context`: 医疗场景(临床术语、医疗对话) ## 医疗语音特征 本数据集收录了真实医疗场景下的语音特征: - **医疗专业术语**:如"hypertension(高血压)"、"myocardial infarction(心肌梗死)"、"differential diagnosis(鉴别诊断)" - **临床缩略语**:口语化的医疗缩略表达,如BP(血压)、HR(心率)、PRN(必要时)等 - **医患互动模式**:轮候发言、请求澄清、共情表达等 - **多语言医疗场景**:跨越语言边界的医疗服务语音 ## 伦理考量 所有数据均通过明确的知情同意流程采集,且未包含任何受保护的健康信息(Protected Health Information, PHI)——所有录音仅涉及通用医疗术语,不包含患者专属数据。 ## 需要更多医疗语音数据? 本数据集是Silencio旗下大型现成医疗语音库的样本版本: 📊 **完整库可用资源**: - 300+小时医疗领域语音数据 - 15+种语言 - 细分专业领域:心脏病学、放射学、外科学、药学等 - 涵盖医护人员与患者双视角 **获取授权请联系**:[alex@silencioai.com](mailto:alex@silencioai.com) ## 引用规范 若您在研究或商业产品中使用本数据集,请引用如下文献: bibtex @dataset{silencio_medical_speech_2026, title={Medical Speech Dataset}, author={Silencio Network}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/SilencioNetwork/medical-speech-dataset} } ## 相关数据集 - [完整语音AI语音数据集](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset)——涵盖39种语言/口音变体 - [印度语言语音数据集](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech)——涵盖9种印度语言 - [欧洲语言语音数据集](https://huggingface.co/datasets/SilencioNetwork/european-languages-speech)——涵盖5种欧洲语言 - [全球英语口音语音数据集](https://huggingface.co/datasets/SilencioNetwork/global-english-accents-speech)——涵盖20种英语口音变体 ## 许可证 **CC BY-NC 4.0**(知识共享署名-非商业性使用4.0国际许可协议) ✅ 可免费用于研究与非商业用途 ❌ 商业使用需获得授权(请联系我们) ## 关于Silencio Silencio是一家语音AI数据服务公司,在全球180+国家拥有200万+数据贡献者,为AI实验室、机器人公司与医疗人工智能开发者提供规模化的真实世界音频与语音数据采集服务。 🌐 [silencioai.com](https://silencioai.com) 📧 [sofia@silencioai.com](mailto:sofia@silencioai.com) **标签**:医疗语音、医疗人工智能、临床文档记录、医疗自动语音识别(ASR)、医疗口述转录、实时临床记录、领域专属语音、医疗术语、医疗自然语言处理、语音健康
提供机构:
SilencioNetwork
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作