five

南非新闻数据集

收藏
arXiv2020-02-18 更新2024-06-21 收录
下载链接:
https://doi.org/10.5281/zenodo.3668489
下载链接
链接失效反馈
官方服务:
资源简介:
南非新闻数据集是由比勒陀利亚大学等机构创建,专注于Setswana和Sepedi两种低资源语言的新闻标题数据,包含710条新闻标题。数据集通过爬取南非广播公司Facebook页面上的新闻标题获得,并进行了分类标注,涵盖法律、体育、政治等多个类别。创建过程中,研究者利用了数据增强技术以提高分类模型的性能。该数据集主要用于支持低资源语言的自然语言处理研究,旨在解决这些语言在数据资源上的不足问题。

The South African News Dataset, developed by institutions including the University of Pretoria, focuses on news headline data for two low-resource languages, Setswana and Sepedi, and comprises a total of 710 news headlines. The dataset was obtained by crawling news headlines from the Facebook pages of the South African Broadcasting Corporation, and has been categorized and annotated, covering multiple categories such as law, sports, politics and others. During its development, researchers utilized data augmentation techniques to improve the performance of classification models. This dataset is primarily used to support natural language processing research on low-resource languages, aiming to address the shortage of data resources for these languages.
提供机构:
比勒陀利亚大学
创建时间:
2020-02-18
二维码
社区交流群
二维码
科研交流群
商业服务