five

Kurdish Medical Corpus (KMC)

收藏
DataCite Commons2026-03-31 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/cn5grbx63n/1
下载链接
链接失效反馈
官方服务:
资源简介:
The Kurdish Medical Corpus (KMC) is a structured dataset containing 121,918 clinical entries in the Kurdish (Sorani) language. It is designed to support research in medical language-based data systems and low-resource language technologies, particularly in the healthcare domain. The dataset covers a wide spectrum of biomedical and clinical knowledge, including pharmacology, internal medicine, mental health, diagnostics, symptoms, anatomy, and general healthcare concepts. This broad coverage ensures that the corpus reflects diverse real-world medical knowledge and terminology. Each entry is stored in a standardized JSON format, enabling consistent representation of structured medical information and facilitating computational processing and downstream analytical tasks. KMC contains over 6.1 million words (~7.9 million tokens), making it a large-scale resource for Kurdish medical text data. The dataset is organized to preserve semantic clarity and domain consistency, providing high-quality structured content suitable for automated processing and knowledge modeling. The corpus is intended for a variety of applications, including named entity recognition, medical text classification, information extraction, question answering systems, semantic search, and large language model (LLM) fine-tuning. Its structured nature makes it particularly useful for building domain-specific intelligent systems in healthcare. Overall, the KMC dataset represents a significant resource for advancing computational healthcare research in Kurdish and contributes to bridging the gap in low-resource medical language resources, enabling further development of cross-lingual and AI-driven healthcare solutions.
提供机构:
Mendeley Data
创建时间:
2026-03-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作