Dagbani Wiki Text
收藏Mendeley Data2024-05-10 更新2024-06-28 收录
下载链接:
https://zenodo.org/records/8186835
下载链接
链接失效反馈官方服务:
资源简介:
The Dagbani Sentences Dataset is a collection of sentences in the Dagbani language, a Gur language spoken by the Dagomba people of Northern Ghana. The dataset was obtained by scraping sentences from Wikipedia articles written in Dagbani. Content: The dataset comprises a zip file containing a text file, with each line in the text file representing a sentence from an article on the Dagbani Wikipedia page. The text file is encoded in the UTF-8 encoding format, and covers a wide range of topics from folklore, legends, education, to politics and health, among others. Source: The dataset was compiled by scraping sentences from Wikipedia pages written in Dagbani. The sentences were extracted using web scraping techniques, and the data were collected with proper respect for copyright and usage policies of the Wikimedia Foundation. Use Cases: The Dagbani Sentences Dataset can be valuable for researchers, linguists, and natural language processing (NLP) practitioners interested in the study of the Dagbani language. Particularly, the dataset is best suited for language modelling (GPT, BERT).
达格班语句子数据集(Dagbani Sentences Dataset)是一组达格班语句子的集合。达格班语是古尔语族的一种语言,由加纳北部的达贡巴族使用。该数据集通过网页抓取达格班语维基百科文章中的句子构建而成。
内容:本数据集以ZIP压缩包形式提供,内含一个文本文件,文本文件中的每一行均对应达格班语维基百科页面某篇文章中的一句句子。该文本文件采用UTF-8编码格式,涵盖民俗、传说、教育、政治与健康等多元主题。
来源:本数据集通过网络抓取技术从达格班语维基百科页面提取句子完成汇编,数据采集过程严格遵循维基媒体基金会(Wikimedia Foundation)的版权及使用政策。
应用场景:达格班语句子数据集可服务于研究人员、语言学家以及对达格班语研究感兴趣的自然语言处理(Natural Language Processing, NLP)从业者。该数据集尤其适用于语言建模任务,可适配GPT、BERT等模型。
创建时间:
2023-08-02



