SilencioNetwork/medical-speech-dataset
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SilencioNetwork/medical-speech-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
- es
- de
task_categories:
- automatic-speech-recognition
- text-to-speech
tags:
- medical-speech
- healthcare-ai
- clinical-documentation
- voice-ai
- speech-data
- asr
pretty_name: "Medical Speech Dataset"
dataset_info:
features:
- name: file_name
dtype: string
- name: id
dtype: int64
- name: gender
dtype: string
- name: ethnicity
dtype: string
- name: occupation
dtype: string
- name: birth_place
dtype: string
- name: mother_tongue
dtype: string
- name: dialect
dtype: string
- name: year_of_birth
dtype: int64
- name: years_at_birth_place
dtype: int64
- name: languages_data
dtype: string
- name: os
dtype: string
- name: device
dtype: string
- name: browser
dtype: string
- name: duration
dtype: float64
- name: emotions
dtype: string
- name: language
dtype: string
- name: location
dtype: string
- name: noise_sources
dtype: string
- name: script_id
dtype: int64
- name: type_of_script
dtype: string
- name: script
dtype: string
- name: transcript
dtype: string
- name: speaker_id
dtype: string
configs:
- config_name: english_medical
data_files:
- split: medical
path: english_medical/medical/**
- config_name: global_medical
data_files:
- split: medical
path: global_medical/medical/**
size_categories:
- n<1K
---
# Medical Speech Dataset
**A specialized speech dataset for healthcare AI applications featuring real medical terminology, clinical conversations, and domain-specific vocabulary.**
This dataset is curated from the [complete-voiceai-speech-dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) and focuses specifically on medical domain speech data collected from real healthcare contexts.
## Dataset Overview
- **Total audio files**: 33 recordings
- **Total duration**: ~42 minutes
- **Languages**: English (native) + Global Medical (multilingual)
- **Domain**: Medical terminology, clinical documentation, patient-provider conversations
- **Audio format**: WAV files
- **Sample rate**: 48 kHz
- **License**: CC BY-NC 4.0 (free for research, non-commercial use)
## Target Applications
This dataset is designed for:
- **Medical ASR systems** (ambient clinical documentation, medical dictation)
- **Healthcare AI assistants** (Abridge, Suki, Nabla, Ambience Healthcare)
- **Medical voice note transcription**
- **Clinical conversation analysis**
- **Medical terminology recognition models**
- **Healthcare dialogue systems**
## Dataset Structure
```
medical-speech-dataset/
├── english_medical/
│ └── medical/
│ ├── data/ # 8 audio files
│ └── metadata.csv # Speaker metadata
└── global_medical/
└── medical/
├── data/ # 25 audio files
└── metadata.csv # Speaker metadata
```
## Data Splits
### English Medical (Native Speakers)
- **Files**: 8 recordings
- **Context**: Native English speakers discussing medical topics
- **Use case**: High-accuracy medical ASR training, US/UK clinical documentation
### Global Medical (Multilingual)
- **Files**: 25 recordings
- **Context**: Medical speech from diverse linguistic backgrounds
- **Use case**: Accent-robust medical ASR, global telehealth applications
## Key Features
✅ **Real medical terminology** - Conditions, medications, procedures, anatomical terms
✅ **Natural speech patterns** - Disfluencies, hesitations, clinical conversation flow
✅ **Diverse accents** - Global medical professionals and patients
✅ **Domain-specific vocabulary** - Not available in general speech datasets
✅ **Ethical data collection** - Consent-based, privacy-preserving
## Use Cases
### 1. Ambient Clinical Documentation
Train models to transcribe doctor-patient conversations in real-time (similar to Abridge, Suki, Nabla).
### 2. Medical Dictation Systems
Improve accuracy for physicians dictating clinical notes, discharge summaries, and prescriptions.
### 3. Telehealth Transcription
Build ASR systems for virtual healthcare consultations across diverse accents and languages.
### 4. Medical Voice Assistants
Develop voice-enabled healthcare tools for symptom checking, medication reminders, and patient education.
### 5. Clinical Research
Analyze speech patterns in medical contexts, study communication dynamics between providers and patients.
## Loading the Dataset
```python
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("SilencioNetwork/medical-speech-dataset")
# Load specific split
english_medical = load_dataset("SilencioNetwork/medical-speech-dataset", data_dir="english_medical")
global_medical = load_dataset("SilencioNetwork/medical-speech-dataset", data_dir="global_medical")
```
## Sample Metadata
Each recording includes:
- `file_name`: Audio file identifier
- `birth_place`: Speaker's country/region of origin
- `language`: Primary language spoken
- `context`: Medical (clinical terminology, healthcare conversations)
## Medical Speech Characteristics
This dataset captures real-world medical speech features:
- **Medical jargon**: "hypertension", "myocardial infarction", "differential diagnosis"
- **Clinical abbreviations**: Spoken medical shorthand (BP, HR, PRN, etc.)
- **Provider-patient dynamics**: Turn-taking, clarification requests, empathy markers
- **Multilingual medical contexts**: Healthcare delivery across linguistic boundaries
## Ethical Considerations
All data was collected with explicit informed consent. No protected health information (PHI) is included - all recordings contain general medical terminology only, not patient-specific data.
## Need More Medical Speech Data?
This is a sample dataset from Silencio's larger Off-the-Shelf (OTS) medical speech inventory:
📊 **Available in full inventory:**
- 300+ hours of medical domain speech
- 15+ languages
- Specialized domains: cardiology, radiology, surgery, pharmacy, etc.
- Provider + patient perspectives
**Contact us for access**: [alex@silencioai.com](mailto:alex@silencioai.com)
## Citation
If you use this dataset in your research or commercial product, please cite:
```bibtex
@dataset{silencio_medical_speech_2026,
title={Medical Speech Dataset},
author={Silencio Network},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/SilencioNetwork/medical-speech-dataset}
}
```
## Related Datasets
- [Complete Voice AI Speech Dataset](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset) - 39 language/accent variants
- [Indian Languages Speech](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech) - 9 Indian languages
- [European Languages Speech](https://huggingface.co/datasets/SilencioNetwork/european-languages-speech) - 5 European languages
- [Global English Accents Speech](https://huggingface.co/datasets/SilencioNetwork/global-english-accents-speech) - 20 English accent variants
## License
**CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International)
✅ Free for research and non-commercial use
❌ Commercial use requires licensing (contact us)
## About Silencio
Silencio is a voice AI data sourcing company with 2M+ contributors across 180+ countries. We provide scaled sourcing of real-world audio and speech data for AI labs, robotics companies, and healthcare AI developers.
🌐 [silenciai.com](https://silencioai.com)
📧 [sofia@silencioai.com](mailto:sofia@silencioai.com)
---
**Tags**: medical speech, healthcare AI, clinical documentation, medical ASR, medical dictation, ambient scribe, domain-specific speech, medical terminology, healthcare NLP, voice health
license: cc-by-nc-4.0
language:
- en
- es
- de
task_categories:
- 自动语音识别(ASR)
- 文本转语音(text-to-speech)
tags:
- 医疗语音
- 医疗人工智能
- 临床文档记录
- 语音人工智能
- 语音数据
- 自动语音识别(ASR)
pretty_name: "医疗语音数据集"
dataset_info:
features:
- name: 文件名
dtype: 字符串
- name: 编号
dtype: 64位整数
- name: 性别
dtype: 字符串
- name: 种族
dtype: 字符串
- name: 职业
dtype: 字符串
- name: 出生地
dtype: 字符串
- name: 母语
dtype: 字符串
- name: 方言
dtype: 字符串
- name: 出生年份
dtype: 64位整数
- name: 在出生地居住时长
dtype: 64位整数
- name: 语言数据
dtype: 字符串
- name: 操作系统
dtype: 字符串
- name: 设备
dtype: 字符串
- name: 浏览器
dtype: 字符串
- name: 时长
dtype: 64位浮点数
- name: 情绪
dtype: 字符串
- name: 语言
dtype: 字符串
- name: 采集位置
dtype: 字符串
- name: 噪声源
dtype: 字符串
- name: 脚本编号
dtype: 64位整数
- name: 脚本类型
dtype: 字符串
- name: 脚本文本
dtype: 字符串
- name: 转录文本
dtype: 字符串
- name: 说话人ID
dtype: 字符串
configs:
- config_name: 英语医疗语料
data_files:
- split: 医疗
path: english_medical/medical/**
- config_name: 全球医疗语料
data_files:
- split: 医疗
path: global_medical/medical/**
size_categories:
- n<1000
# 医疗语音数据集
**一款面向医疗人工智能应用的专业化语音数据集,涵盖真实医疗术语、临床对话与领域专属词汇。**
本数据集源自[完整语音AI语音数据集](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset),专门聚焦于从真实医疗场景中采集的医疗领域语音数据。
## 数据集概览
- **总音频文件数**:33条录音
- **总时长**:约42分钟
- **语言**:英语(母语)+ 全球医疗多语言语料
- **应用领域**:医疗术语、临床文档记录、医患对话
- **音频格式**:WAV文件
- **采样率**:48 kHz
- **许可证**:CC BY-NC 4.0(可免费用于研究与非商业用途)
## 目标应用场景
本数据集专为以下场景设计:
- **医疗自动语音识别(ASR)系统**(用于实时临床文档记录、医疗口述转录)
- **医疗人工智能助手**(如Abridge、Suki、Nabla、Ambience Healthcare)
- **医疗语音笔记转录**
- **临床对话分析**
- **医疗术语识别模型**
- **医疗对话系统**
## 数据集结构
medical-speech-dataset/
├── english_medical/
│ └── medical/
│ ├── data/ # 8条音频文件
│ └── metadata.csv # 说话人元数据
└── global_medical/
└── medical/
├── data/ # 25条音频文件
└── metadata.csv # 说话人元数据
## 数据拆分
### 英语医疗语料(母语使用者)
- **文件数**:8条录音
- **场景**:以英语为母语的说话人讨论医疗话题
- **适用场景**:高精度医疗ASR训练、英美临床文档记录
### 全球医疗语料(多语言)
- **文件数**:25条录音
- **场景**:来自多元语言背景的医疗语音
- **适用场景**:支持多口音的医疗ASR、全球远程医疗应用
## 核心特性
✅ **真实医疗术语**——涵盖疾病、药物、诊疗操作与解剖学术语
✅ **自然语音模式**——包含语音卡顿、犹豫与临床对话的自然流转
✅ **多元口音**——涵盖全球医疗从业者与患者的语音
✅ **领域专属词汇**——通用语音数据集未收录的医疗专属词汇
✅ **合规数据采集**——基于知情同意,兼顾隐私保护
## 应用场景
### 1. 实时临床文档记录
训练模型实时转录医患对话(类似Abridge、Suki、Nabla等工具)。
### 2. 医疗口述转录系统
提升医师口述临床记录、出院小结与处方的转录准确率。
### 3. 远程医疗转录
为跨口音、跨语言的虚拟医疗会诊构建ASR系统。
### 4. 医疗语音助手
开发用于症状自查、用药提醒与患者教育的语音医疗工具。
### 5. 临床研究
分析医疗场景下的语音模式,研究医护人员与患者的沟通互动规律。
## 数据集加载
python
from datasets import load_dataset
# 加载完整数据集
dataset = load_dataset("SilencioNetwork/medical-speech-dataset")
# 加载指定拆分
english_medical = load_dataset("SilencioNetwork/medical-speech-dataset", data_dir="english_medical")
global_medical = load_dataset("SilencioNetwork/medical-speech-dataset", data_dir="global_medical")
## 样本元数据
每条录音包含以下字段:
- `file_name`: 音频文件标识符
- `birth_place`: 说话人原籍国家/地区
- `language`: 主要使用语言
- `context`: 医疗场景(临床术语、医疗对话)
## 医疗语音特征
本数据集收录了真实医疗场景下的语音特征:
- **医疗专业术语**:如"hypertension(高血压)"、"myocardial infarction(心肌梗死)"、"differential diagnosis(鉴别诊断)"
- **临床缩略语**:口语化的医疗缩略表达,如BP(血压)、HR(心率)、PRN(必要时)等
- **医患互动模式**:轮候发言、请求澄清、共情表达等
- **多语言医疗场景**:跨越语言边界的医疗服务语音
## 伦理考量
所有数据均通过明确的知情同意流程采集,且未包含任何受保护的健康信息(Protected Health Information, PHI)——所有录音仅涉及通用医疗术语,不包含患者专属数据。
## 需要更多医疗语音数据?
本数据集是Silencio旗下大型现成医疗语音库的样本版本:
📊 **完整库可用资源**:
- 300+小时医疗领域语音数据
- 15+种语言
- 细分专业领域:心脏病学、放射学、外科学、药学等
- 涵盖医护人员与患者双视角
**获取授权请联系**:[alex@silencioai.com](mailto:alex@silencioai.com)
## 引用规范
若您在研究或商业产品中使用本数据集,请引用如下文献:
bibtex
@dataset{silencio_medical_speech_2026,
title={Medical Speech Dataset},
author={Silencio Network},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/SilencioNetwork/medical-speech-dataset}
}
## 相关数据集
- [完整语音AI语音数据集](https://huggingface.co/datasets/SilencioNetwork/complete-voiceai-speech-dataset)——涵盖39种语言/口音变体
- [印度语言语音数据集](https://huggingface.co/datasets/SilencioNetwork/indian-languages-speech)——涵盖9种印度语言
- [欧洲语言语音数据集](https://huggingface.co/datasets/SilencioNetwork/european-languages-speech)——涵盖5种欧洲语言
- [全球英语口音语音数据集](https://huggingface.co/datasets/SilencioNetwork/global-english-accents-speech)——涵盖20种英语口音变体
## 许可证
**CC BY-NC 4.0**(知识共享署名-非商业性使用4.0国际许可协议)
✅ 可免费用于研究与非商业用途
❌ 商业使用需获得授权(请联系我们)
## 关于Silencio
Silencio是一家语音AI数据服务公司,在全球180+国家拥有200万+数据贡献者,为AI实验室、机器人公司与医疗人工智能开发者提供规模化的真实世界音频与语音数据采集服务。
🌐 [silencioai.com](https://silencioai.com)
📧 [sofia@silencioai.com](mailto:sofia@silencioai.com)
**标签**:医疗语音、医疗人工智能、临床文档记录、医疗自动语音识别(ASR)、医疗口述转录、实时临床记录、领域专属语音、医疗术语、医疗自然语言处理、语音健康
提供机构:
SilencioNetwork



