Hausa语言数据集
收藏arXiv2021-02-17 更新2024-06-21 收录
下载链接:
https://github.com/ijdutse/hausa-corpus
下载链接
链接失效反馈官方服务:
资源简介:
Hausa语言数据集是由圣安德鲁斯大学计算机科学学院的Isa Inuwa-Dutse创建,旨在为自然语言处理(NLP)任务提供丰富的Hausa语言资源。该数据集包含从可信网站和在线社交媒体网络收集的正式和非正式Hausa语言文本,数据量庞大且多样性丰富,首次提供了大量Hausa社交媒体数据,以捕捉语言的独特性。数据集的创建过程包括数据收集、预处理,并提供了获取数据的方法。该数据集主要应用于机器翻译、自动语音识别和检测虚假在线内容等领域,以解决Hausa语言在NLP中的资源匮乏问题。
The Hausa language dataset was created by Isa Inuwa-Dutse from the School of Computer Science, University of St Andrews, aiming to provide abundant Hausa language resources for natural language processing (NLP) tasks. The dataset contains formal and informal Hausa language texts collected from credible websites and online social media networks. It is large in scale and rich in diversity, and for the first time, it provides a large volume of Hausa social media data to capture the unique features of the language. The creation process of the dataset includes data collection and preprocessing, and methods for accessing the data are also provided. This dataset is mainly applied in fields such as machine translation, automatic speech recognition and fake online content detection, so as to address the scarcity of Hausa language resources in NLP.
提供机构:
圣安德鲁斯大学计算机科学学院
创建时间:
2021-02-14



