temitopeolagoke/yorumed

Name: temitopeolagoke/yorumed
Creator: temitopeolagoke
Published: 2026-04-21 12:59:17
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/temitopeolagoke/yorumed

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - yo - en license: cc-by-4.0 task_categories: - translation - text-classification pretty_name: YoruMed size_categories: - 1K<n<10K tags: - yoruba - medical - low-resource - african-languages - nlp - terminology - biomedical --- # YoruMed: A Yoruba Medical Terminology Dataset ## Dataset Description **YoruMed** is a structured Yoruba medical terminology dataset, comprising 1,000 English medical terms, their plain-language English definitions, and corresponding Yoruba translations. Yoruba is spoken by over 50 million people across Nigeria, Benin, and Togo, yet remains severely underrepresented in biomedical NLP. YoruMed addresses this gap by providing a curated, linguistically annotated dataset that supports medical machine translation, terminology extraction, cross-lingual retrieval, and evaluation of multilingual language models in Yoruba. - **Curated by:** Temitope Olagoke, Department of Linguistics, Obafemi Awolowo University, Ile-Ife, Nigeria - **Language(s):** Yoruba (yo), English (en) - **License:** CC-BY 4.0 - **Originally created:** 2019 (undergraduate linguistics thesis, OAU) - **Released as NLP dataset:** 2026 --- ## Dataset Structure ### Fields | Field | Type | Description | |---|---|---| | `english_term` | string | The English medical term | | `english_definition` | string | Plain-language English definition of the term | | `yoruba_translation` | string | Yoruba translation with full tonal diacritics | | `translation_strategy` | categorical | Broad NLP category: `MEANING-BASED` or `FORM-BASED` | | `linguistic_strategy` | categorical | Linguistic detail: `MORPHOLOGICAL`, `EXISTING`, or `LOANWORD` | | `medical_domain` | categorical | Medical domain — see categories below | ### Translation Strategy Translations were produced using a theoretically grounded framework based on Larson (1984) and Nida & Taber (1982): **Broad categories (NLP-friendly):** - `MEANING-BASED` (901 terms, 90.1%) — Yoruba-native translation, prioritising meaning equivalence over formal correspondence - `FORM-BASED` (99 terms, 9.9%) — Phonological adaptation of English/Latin source term into Yoruba **Linguistic detail (3-way):** - `MORPHOLOGICAL` (809 terms, 80.9%) — New Yoruba terms coined using native morphological resources (compounding, derivation, descriptive circumlocution) - `EXISTING` (92 terms, 9.2%) — Established Yoruba words applied to medical concepts - `LOANWORD` (99 terms, 9.9%) — English/Latin terms adapted into Yoruba phonology and orthography ### Medical Domains | Domain | Count | |---|---| | General Medicine | 209 | | Anatomy | 178 | | Haematology | 162 | | Infectious Disease | 128 | | Immunology | 116 | | Pharmacology | 62 | | Obstetrics & Gynaecology | 50 | | Neurology | 49 | | Cardiology | 36 | | Surgery | 10 | --- ## Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("temitopeolagoke/yorumed") print(dataset) ``` Or load directly with pandas: ```python import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/temitopeolagoke/yorumed/resolve/main/YoruMed_final_v2.csv") print(df.head()) print(f"Total terms: {len(df)}") ``` ### Example entries ```python # Filter by domain infectious = df[df['medical_domain'] == 'Infectious Disease'] # Filter by translation strategy morphological = df[df['linguistic_strategy'] == 'MORPHOLOGICAL'] # Filter by broad strategy meaning_based = df[df['translation_strategy'] == 'MEANING-BASED'] ``` --- ## Intended Uses YoruMed supports the following NLP tasks: ### 1. Medical Machine Translation (English → Yoruba) Evaluate whether multilingual models (NLLB-200, mBERT, Aya) correctly translate English medical terminology into Yoruba. Use as fine-tuning data to improve medical translation quality. ### 2. Biomedical Named Entity Recognition Use as a reference lexicon for developing Yoruba biomedical NER systems to identify medical entities in Yoruba-language clinical text. ### 3. Cross-Lingual Information Retrieval The parallel English definition and Yoruba translation structure supports retrieval tasks — assessing whether multilingual embedding models encode Yoruba medical concepts correctly. ### 4. LLM Evaluation in Yoruba Benchmark large language models (GPT-4, Claude, Llama, Aya) on Yoruba medical terminology knowledge to quantify performance gaps relative to English. ### 5. Linguistic Analysis The `linguistic_strategy` and `medical_domain` annotations support research into Yoruba lexical adaptation patterns, morphological productivity, and domain-specific vocabulary development. --- ## Limitations 1. **Single translator:** All translations were produced by one linguist and validated by a faculty supervisor at OAU. Community-scale validation has not been conducted. 2. **Dialect variation:** Standard Yoruba orthography is used. Yoruba has significant dialect variation; some translations may not be equally natural across all dialect communities. 3. **Domain concentration:** Some concentration in infectious disease/HIV-AIDS terminology, reflecting the primary reference source (Yusuff et al., 2017). 4. **Scale:** At 1,000 entries, YoruMed is a terminology dataset, not a large-scale corpus. It is a benchmark and reference resource, not a language modelling training corpus. 5. **Date of compilation:** Originally compiled in 2019. Some drug names and clinical terminology may have evolved since. --- ## Ethical Considerations YoruMed is derived from publicly available English medical terminology and original Yoruba translations. No patient data or clinical records were used. **Important:** YoruMed is an NLP research resource. It should **not** be used as a substitute for qualified medical interpreters or practitioners in clinical settings. This dataset was created in the spirit of language equity — Yoruba speakers deserve access to health information and AI technology in their own language. Researchers using YoruMed are encouraged to engage with Yoruba-speaking medical communities for further validation. --- ## Citation If you use YoruMed in your research, please cite: ```bibtex @dataset{olagoke2026yorumed, author = {Olagoke, Temitope}, title = {YoruMed: A Yoruba Medical Terminology Dataset for Low-Resource African Language NLP}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/temitopeolagoke/yorumed}, note = {Originally developed as undergraduate linguistics thesis, Obafemi Awolowo University, 2019} } ``` --- ## Related Work - Yusuff, L. A., Adetunji, A. & Odoje, C. (2017). *English-Yoruba Glossary of HIV, AIDS and Ebola-Related Terms*. University Press Plc Ibadan. - Adelani, D. et al. (2022). MAFAND-MT: A Benchmark for Low-Resource African Language Machine Translation. *ACL 2022*. - Nekoto, W. et al. (2020). Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. *EMNLP Findings 2020*. - Nida, E. A. & Taber, C. R. (1982). *The Theory and Practice of Translation*. E.J. Brill. - Larson, L. M. (1984). *Meaning-Based Translation*. University Press of America. --- ## Contact Temitope Olagoke hello.temitopeolagoke@gmail.com HuggingFace: [@temitopeolagoke](https://huggingface.co/temitopeolagoke)

--- 语言： - 约鲁巴语（yo） - 英语（en）授权协议：知识共享署名4.0（CC-BY 4.0）任务类别： - 翻译 - 文本分类数据集名称：YoruMed 数据规模：1000 < 条目数 < 10000 标签： - 约鲁巴语 - 医疗 - 低资源 - 非洲语言 - 自然语言处理（Natural Language Processing，简称NLP） - 术语 - 生物医学 --- # YoruMed：一款约鲁巴语生物医学术语数据集 ## 数据集概况 **YoruMed**是一款结构化的约鲁巴语生物医学术语数据集，包含1000条英语医学术语、对应的通俗英语释义以及相应的约鲁巴语译文。约鲁巴语在尼日利亚、贝宁和多哥被超过5000万人使用，但在生物医学自然语言处理（Natural Language Processing，简称NLP）领域的代表性严重不足。YoruMed通过提供经过精心整理、带有语言学标注的数据集，填补了这一空白，可用于支持医学机器翻译、术语抽取、跨语言检索以及约鲁巴语多语言大语言模型（Large Language Model，简称LLM）的评估。 - **整理者**：Temitope Olagoke，尼日利亚伊莱伊费奥巴费米·阿沃洛沃大学语言学系 - **语言**：约鲁巴语（yo）、英语（en） - **授权协议**：CC-BY 4.0（知识共享署名4.0协议） - **原始创作时间**：2019年（尼日利亚奥巴费米·阿沃洛沃大学本科语言学毕业论文） - **作为NLP数据集发布时间**：2026年 --- ## 数据集结构 ### 字段说明 | 字段名 | 数据类型 | 描述 | |---|---|---| | `english_term` | 字符串 | 英语医学术语 | | `english_definition` | 字符串 | 该术语的通俗英语释义 | | `yoruba_translation` | 字符串 | 带有完整声调变音符号的约鲁巴语译文 | | `translation_strategy` | 分类变量 | 通用NLP分类：`MEANING-BASED`（基于语义）或`FORM-BASED`（基于形式） | | `linguistic_strategy` | 分类变量 | 语言学细节分类：`MORPHOLOGICAL`（形态学）、`EXISTING`（现有词汇）或`LOANWORD`（借词） | | `medical_domain` | 分类变量 | 医学领域分类——详见下文分类列表 | ### 翻译策略翻译工作基于Larson（1984）与Nida & Taber（1982）的理论框架开展： #### 通用NLP适配分类 - `MEANING-BASED`（901条，占比90.1%）：以约鲁巴语母语者的译法为准，优先保障语义对等而非形式对应 - `FORM-BASED`（99条，占比9.9%）：将英语/拉丁语源术语通过语音适配转换为约鲁巴语形式 #### 语言学细节分类（三类） - `MORPHOLOGICAL`（809条，占比80.9%）：利用约鲁巴语原生形态资源（复合构词、派生构词、描述性迂回表达）创造的新约鲁巴语术语 - `EXISTING`（92条，占比9.2%）：将已有的约鲁巴语词汇应用于医学概念 - `LOANWORD`（99条，占比9.9%）：经语音和正字法适配后引入约鲁巴语体系的英语/拉丁语术语 ### 医学领域分类 | 领域 | 条目数 | |---|---| | 全科医学 | 209 | | 解剖学 | 178 | | 血液学 | 162 | | 感染性疾病 | 128 | | 免疫学 | 116 | | 药理学 | 62 | | 妇产科学 | 50 | | 神经病学 | 49 | | 心脏病学 | 36 | | 外科学 | 10 | --- ## 数据集加载方式 python from datasets import load_dataset dataset = load_dataset("temitopeolagoke/yorumed") print(dataset) 也可通过pandas直接加载： python import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/temitopeolagoke/yorumed/resolve/main/YoruMed_final_v2.csv") print(df.head()) print(f"总术语数：{len(df)}") #### 示例筛选操作 python # 按领域筛选 infectious = df[df['medical_domain'] == 'Infectious Disease'] # 按语言学策略筛选 morphological = df[df['linguistic_strategy'] == 'MORPHOLOGICAL'] # 按通用翻译策略筛选 meaning_based = df[df['translation_strategy'] == 'MEANING-BASED'] --- ## 预期应用场景 YoruMed可支持以下NLP任务： ### 1. 医学机器翻译（英语→约鲁巴语）可用于评估多语言模型（如NLLB-200、mBERT、Aya）能否正确将英语医学术语翻译为约鲁巴语，同时可作为微调数据提升医学翻译质量。 ### 2. 生物医学命名实体识别（Named Entity Recognition，简称NER）可作为参考词典，用于开发约鲁巴语生物医学NER系统，以在约鲁巴语临床文本中识别医学实体。 ### 3. 跨语言信息检索该数据集包含平行的英语释义和约鲁巴语译文结构，可支持检索任务——评估多语言嵌入模型能否正确编码约鲁巴语医学概念。 ### 4. 约鲁巴语大语言模型评估可作为基准测试集，用于评测大语言模型（如GPT-4、Claude、Llama、Aya）的约鲁巴语医学术语知识储备，量化其相对于英语的性能差距。 ### 5. 语言学分析通过`linguistic_strategy`和`medical_domain`标注，可开展约鲁巴语词汇适配模式、形态学生成能力以及领域特定词汇发展等相关研究。 --- ## 数据集局限性 1. **单一译者**：所有译文均由一名语言学家完成，仅经过奥巴费米·阿沃洛沃大学的导师审核，未开展社区规模的验证工作。 2. **方言差异**：数据集采用标准约鲁巴语正字法，但约鲁巴语存在显著的方言差异，部分译文在不同方言社区中的自然度可能存在差异。 3. **领域集中度**：数据集在感染性疾病/艾滋病相关术语上存在一定集中度，这源于其主要参考来源（Yusuff等人，2017）。 4. **数据规模**：YoruMed仅包含1000条条目，属于术语数据集而非大规模语料库，仅作为基准测试与参考资源，而非语言模型训练语料。 5. **编制时间**：数据集于2019年编制，部分药品名称与临床术语可能已发生更新。 --- ## 伦理考量 YoruMed的素材来源于公开可用的英语医学术语以及原创约鲁巴语译文，未使用任何患者数据或临床记录。 **重要提示**：YoruMed仅为NLP研究资源，**不得**作为临床场景中合格医学口译人员或从业者的替代工具。本数据集的开发旨在推动语言公平——约鲁巴语使用者有权获取本民族语言的健康信息与人工智能技术。鼓励使用YoruMed的研究人员与约鲁巴语医学社区开展合作，以进一步验证数据集内容。 --- ## 引用格式若您在研究中使用YoruMed，请引用如下文献： bibtex @dataset{olagoke2026yorumed, author = {Olagoke, Temitope}, title = {YoruMed: A Yoruba Medical Terminology Dataset for Low-Resource African Language NLP}, year = {2026}, publisher = {Hugging Face}, url = {"https://huggingface.co/datasets/temitopeolagoke/yorumed"}, note = {Originally developed as undergraduate linguistics thesis, Obafemi Awolowo University, 2019} } --- ## 相关研究 - Yusuff, L. A., Adetunji, A. & Odoje, C. (2017). *English-Yoruba Glossary of HIV, AIDS and Ebola-Related Terms*. University Press Plc Ibadan. - Adelani, D. et al. (2022). MAFAND-MT: A Benchmark for Low-Resource African Language Machine Translation. *ACL 2022*. - Nekoto, W. et al. (2020). Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. *EMNLP Findings 2020*. - Nida, E. A. & Taber, C. R. (1982). *The Theory and Practice of Translation*. E.J. Brill. - Larson, L. M. (1984). *Meaning-Based Translation*. University Press of America. --- ## 联系方式 Temitope Olagoke 邮箱：hello.temitopeolagoke@gmail.com HuggingFace账号：[@temitopeolagoke](https://huggingface.co/temitopeolagoke)

提供机构：

temitopeolagoke

5,000+

优质数据集

54 个

任务类型

进入经典数据集