Setswana and Sepedi News Headlines
收藏arXiv2020-03-31 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2004.13842v1
下载链接
链接失效反馈官方服务:
资源简介:
本研究针对南非的低资源语言Setswana和Sepedi,创建了专注于新闻标题的数据集。数据集由南非广播公司(SABC)的Thobela FM和Motsweding FM电台收集,共包含219条Setswana新闻标题和491条Sepedi新闻标题。创建过程中,研究团队从社交媒体和广播中收集数据,并进行了分类标注,将新闻标题分为法律、体育、政治等类别。该数据集旨在为这些低资源语言提供语言处理资源,以促进自动化和语言技术的发展。
This study constructs a news headline-focused dataset for two low-resource South African languages, Setswana and Sepedi. The dataset was collected by Thobela FM and Motsweding FM, radio stations affiliated with the South African Broadcasting Corporation (SABC), and contains a total of 219 Setswana news headlines and 491 Sepedi news headlines. During the dataset creation process, the research team collected data from social media and broadcast content, and performed categorical annotation, classifying the news headlines into categories such as law, sports, politics and others. This dataset aims to provide language processing resources for these low-resource languages to promote the development of automation and language technologies.
提供机构:
南非比勒陀利亚大学
创建时间:
2020-03-31



