AgaCKNER: First Kurdish Sorani Named Entity Recognition Dataset
收藏DataCite Commons2025-04-10 更新2025-04-16 收录
下载链接:
https://data.mendeley.com/datasets/b3wvj6jgx8
下载链接
链接失效反馈官方服务:
资源简介:
AgaCKNER is the first publicly accessible Named Entity Recognition (NER) dataset in the Kurdish Sorani language, developed to advance research in low-resource language processing. Derived from the Rudaw Media Network, AgaCKNER encompasses a broad array of topics across five distinct domains: Kurdistan news, Middle East news, world news, economic news, and sports news that are meticulously curated from over 160 articles. The dataset includes 2,534 sentences and 64,563 tokens, pre-processed and formatted in CoNLL for NER tasks. Entities are labelled in BIO format under five categories: PERSON, LOCATION, ORGANIZATION, DATE, and Miscellaneous. AgaCKNER is an essential resource for Kurdish Sorani natural language processing, greatly enhancing research in low-resource languages. Its structure makes it easily adaptable for generating training, validation, and test splits.
AgaCKNER是首个可公开获取的库尔德索拉尼语(Kurdish Sorani)命名实体识别(Named Entity Recognition, NER)数据集,旨在推动低资源语言处理领域的研究。该数据集源自鲁多媒体网络(Rudaw Media Network),从160余篇文章中精心筛选出覆盖五大领域的广泛主题,具体包括库尔德斯坦新闻、中东新闻、世界新闻、经济新闻与体育新闻。数据集共包含2534个句子与64563个Token,已完成预处理并采用CoNLL格式进行格式化,以适配命名实体识别任务。实体采用BIO标注格式,分为五大类别:人物(PERSON)、地点(LOCATION)、组织机构(ORGANIZATION)、日期(DATE)与杂项(Miscellaneous)。AgaCKNER是库尔德索拉尼语自然语言处理的核心资源,极大推动了低资源语言领域的研究工作。其结构具备良好的灵活性,可便捷地划分为训练集、验证集与测试集。
提供机构:
Mendeley Data
创建时间:
2025-03-19



