Kurdish Medical Corpus (KMC)

Name: Kurdish Medical Corpus (KMC)
Creator: Mendeley Data
Published: 2026-03-31 17:46:51
License: 暂无描述

DataCite Commons2026-03-31 更新2026-05-04 收录

下载链接：

https://data.mendeley.com/datasets/cn5grbx63n/1

下载链接

链接失效反馈

官方服务：

资源简介：

The Kurdish Medical Corpus (KMC) is a structured dataset containing 121,918 clinical entries in the Kurdish (Sorani) language. It is designed to support research in medical language-based data systems and low-resource language technologies, particularly in the healthcare domain. The dataset covers a wide spectrum of biomedical and clinical knowledge, including pharmacology, internal medicine, mental health, diagnostics, symptoms, anatomy, and general healthcare concepts. This broad coverage ensures that the corpus reflects diverse real-world medical knowledge and terminology. Each entry is stored in a standardized JSON format, enabling consistent representation of structured medical information and facilitating computational processing and downstream analytical tasks. KMC contains over 6.1 million words (~7.9 million tokens), making it a large-scale resource for Kurdish medical text data. The dataset is organized to preserve semantic clarity and domain consistency, providing high-quality structured content suitable for automated processing and knowledge modeling. The corpus is intended for a variety of applications, including named entity recognition, medical text classification, information extraction, question answering systems, semantic search, and large language model (LLM) fine-tuning. Its structured nature makes it particularly useful for building domain-specific intelligent systems in healthcare. Overall, the KMC dataset represents a significant resource for advancing computational healthcare research in Kurdish and contributes to bridging the gap in low-resource medical language resources, enabling further development of cross-lingual and AI-driven healthcare solutions.

提供机构：

Mendeley Data

创建时间：

2026-03-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集