jacekduszenko/rare-medical-terms
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jacekduszenko/rare-medical-terms
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- audio-classification
- text-classification
language:
- en
- multilingual
tags:
- medical
- drug-names
- rare-words
- keyword-spotting
- audio-text-embedding
size_categories:
- 100K<n<1M
---
# Rare Medical Terms Dataset
A curated collection of **626,251 speakable medical/pharmaceutical/chemical terms** for training audio-text embedding models (e.g., CLAP-style keyword spotting).
## Sources
| Source | Terms | License |
|---|---|---|
| [DrugBank Vocabulary](https://go.drugbank.com/releases/latest#open-data) | ~33K | CC0 |
| [ChEBI](https://www.ebi.ac.uk/chebi/) | ~180K | Free/Open |
| [MeSH Descriptors](https://www.nlm.nih.gov/mesh/) | ~150K | Public Domain |
| [MeSH Supplementary Records](https://www.nlm.nih.gov/mesh/) | ~260K | Public Domain |
## Filtering
Raw terms were filtered to keep only "speakable" entries:
- No chemical formulas (brackets, equals signs, etc.)
- No numeric characters
- Length between 3-50 characters
- At least 90% alphabetic characters
## Format
JSONL with fields:
- `id`: integer index
- `term`: the medical/pharmaceutical term
## Use Case
Designed for fine-tuning audio-text contrastive models (Whisper encoder + text encoder with SigLIP loss) to detect rare medical terms in long audio via embedding similarity.
许可证:CC0 1.0
任务类别:
- 音频分类(audio-classification)
- 文本分类(text-classification)
语言:
- 英语
- 多语言
标签:
- 医疗
- 药品名称
- 生僻词汇
- 关键词识别(keyword-spotting)
- 音频-文本嵌入(audio-text-embedding)
规模区间:100K<n<1M
# 罕见医疗术语数据集
本数据集为精选得到的626251个可发音的医疗、药学与化学术语集合,用于训练音频-文本嵌入模型(例如类CLAP关键词识别模型)。
## 数据来源
| 数据来源 | 术语数量 | 许可证 |
|---|---|---|
| [DrugBank词汇表(DrugBank Vocabulary)](https://go.drugbank.com/releases/latest#open-data) | 约3.3万 | CC0 |
| [ChEBI数据库(ChEBI)](https://www.ebi.ac.uk/chebi/) | 约18万 | 免费开源 |
| [医学主题词描述符(MeSH Descriptors)](https://www.nlm.nih.gov/mesh/) | 约15万 | 公共领域 |
| [医学主题词补充记录(MeSH Supplementary Records)](https://www.nlm.nih.gov/mesh/) | 约26万 | 公共领域 |
## 筛选规则
原始术语经过筛选,仅保留"可发音"条目:
- 不含化学公式(如括号、等号等符号)
- 不含数字字符
- 字符长度介于3至50之间
- 字母占比不低于90%
## 数据格式
采用JSONL格式,包含以下字段:
- `id`:整数索引
- `term`:医疗/药学/化学术语
## 应用场景
本数据集旨在用于微调音频-文本对比学习模型(Whisper编码器+采用SigLIP损失的文本编码器),以通过嵌入相似度在长音频中检测罕见医疗术语。
提供机构:
jacekduszenko



