Kurdish Medical Corpus (KMC)
收藏DataCite Commons2026-03-31 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/cn5grbx63n/1
下载链接
链接失效反馈官方服务:
资源简介:
The Kurdish Medical Corpus (KMC) is a structured dataset containing 121,918 clinical entries in the Kurdish (Sorani) language. It is designed to support research in medical language-based data systems and low-resource language technologies, particularly in the healthcare domain.
The dataset covers a wide spectrum of biomedical and clinical knowledge, including pharmacology, internal medicine, mental health, diagnostics, symptoms, anatomy, and general healthcare concepts. This broad coverage ensures that the corpus reflects diverse real-world medical knowledge and terminology. Each entry is stored in a standardized JSON format, enabling consistent representation of structured medical information and facilitating computational processing and downstream analytical tasks.
KMC contains over 6.1 million words (~7.9 million tokens), making it a large-scale resource for Kurdish medical text data. The dataset is organized to preserve semantic clarity and domain consistency, providing high-quality structured content suitable for automated processing and knowledge modeling.
The corpus is intended for a variety of applications, including named entity recognition, medical text classification, information extraction, question answering systems, semantic search, and large language model (LLM) fine-tuning. Its structured nature makes it particularly useful for building domain-specific intelligent systems in healthcare.
Overall, the KMC dataset represents a significant resource for advancing computational healthcare research in Kurdish and contributes to bridging the gap in low-resource medical language resources, enabling further development of cross-lingual and AI-driven healthcare solutions.
提供机构:
Mendeley Data
创建时间:
2026-03-31



