Korean Bio-Medical Corpus (KBMC)
收藏arXiv2024-03-24 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2403.16158v1
下载链接
链接失效反馈官方服务:
资源简介:
KBMC是首个针对韩语的开放源代码医学命名实体识别数据集,由首尔国立大学和KAIST创建。该数据集包含6,150个句子,总计153,971个tokens,涵盖了4,162个疾病名称、841个身体部位和396种治疗方法。数据集的创建过程中使用了ChatGPT辅助生成包含医学术语的句子,并采用BIO格式进行实体标注。KBMC主要用于提升医学领域自然语言处理中命名实体识别的准确性,特别是在处理韩语文本时,通过与Naver NER数据集结合使用,显著提高了医学实体识别的性能。
KBMC is the first open-source medical named entity recognition (NER) dataset for Korean, created by Seoul National University and KAIST. It contains 6,150 sentences totaling 153,971 tokens, covering 4,162 disease names, 841 body parts, and 396 treatment methods. During the dataset construction, ChatGPT was utilized to assist in generating sentences with medical terminology, and the BIO tagging format was adopted for entity annotation. KBMC is primarily intended to enhance the accuracy of named entity recognition in medical natural language processing (NLP), especially for Korean text processing. When combined with the Naver NER dataset, it significantly improves the performance of medical entity recognition.
提供机构:
首尔国立大学
创建时间:
2024-03-24



