KINNEWS and KIRNEWS
收藏arXiv2020-10-23 更新2024-06-21 收录
下载链接:
https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
KINNEWS和KIRNEWS是针对基尼亚卢旺达语和基隆迪语的两个多类别新闻文章分类数据集,由电子科技大学计算智能实验室创建。这两个数据集包含从当地新闻网站和报纸收集的新闻文章,KINNEWS包含21,268篇文章,KIRNEWS包含4,612篇文章。数据集创建过程中,通过人工标注将相关类别合并,最终KINNEWS分为14个类别,KIRNEWS分为12个类别。此外,数据集还提供了预处理指南和基线模型,旨在解决低资源语言在自然语言处理领域的数据稀缺问题,并促进这些语言在NLP研究中的应用,如表示学习、跨语言学习等。
KINNEWS and KIRNEWS are two multi-class news article classification datasets targeting Kinyarwanda and Kirundi, respectively. They were developed by the Computational Intelligence Laboratory of the University of Electronic Science and Technology of China. Both datasets are composed of news articles collected from local news websites and newspapers. Specifically, KINNEWS contains 21,268 articles, while KIRNEWS includes 4,612 articles. During the dataset curation process, manual annotation was employed to merge relevant categories, ultimately yielding 14 categories for KINNEWS and 12 categories for KIRNEWS. In addition, the datasets provide preprocessing guidelines and baseline models. The primary goal of these datasets is to address the data scarcity issue faced by low-resource languages in the field of natural language processing (NLP), and to facilitate the application of these languages in NLP-related research including representation learning and cross-lingual learning.
提供机构:
电子科技大学计算机科学与工程学院计算智能实验室
创建时间:
2020-10-23



