ajesujoba/twi_text_c3
收藏数据集概述
数据集描述
数据集摘要
Twi Text C3 数据集是从网络上的多个来源(如圣经、JW300、维基百科等)收集的,用于比较预训练的词嵌入(Fasttext)和基于精选Twi文本训练的嵌入。该数据集包含干净的文本(如圣经)和带有不正确拼写和混合方言的噪声文本。
支持的任务和排行榜
该数据集主要用于训练Twi文本的词嵌入和语言模型。
语言
支持的语言是Twi。
数据集结构
数据实例
每个数据点是一个句子,例如: json { "text": "mfitiaseɛ no onyankopɔn bɔɔ ɔsoro ne asaase" }
数据字段
text:一个字符串特征,每行代表一个句子。
数据分割
仅包含训练集分割。
数据集创建
策划理由
创建该数据集的目的是为了帮助引入新的语言资源——Twi。
源数据
初始数据收集和规范化
数据集来自网络上的多个来源:圣经、JW300和维基百科。具体的数据摘要和统计信息请参见论文中的表1。
源语言生产者
- Jehovah Witness (JW300)
- Twi Bible
- Yorùbá Wikipedia
注释
注释过程
[更多信息需补充]
注释者
[更多信息需补充]
个人和敏感信息
[更多信息需补充]
使用数据的注意事项
数据集的社会影响
[更多信息需补充]
偏见讨论
由于包含JW300和圣经,数据集偏向于宗教领域(基督教)。
其他已知限制
[更多信息需补充]
附加信息
数据集策展人
数据集由Kwabena Amponsah-Kaakyire、Jesujoba Alabi和David Adelani策展,他们是德国萨尔兰大学的在读学生。
许可信息
数据集采用Creative Commons Attribution-NonCommercial 4.0许可。
引用信息
bibtex
@inproceedings{alabi-etal-2020-massive,
title = "Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of {Y}or{u}b{a} and {T}wi", author = "Alabi, Jesujoba and Amponsah-Kaakyire, Kwabena and Adelani, David and Espa{~n}a-Bonet, Cristina", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.335", pages = "2754--2762", abstract = "The success of several architectures to learn semantic representations from unannotated text and the availability of these kind of texts in online multilingual resources such as Wikipedia has facilitated the massive and automatic creation of resources for multiple languages. The evaluation of such resources is usually done for the high-resourced languages, where one has a smorgasbord of tasks and test sets to evaluate on. For low-resourced languages, the evaluation is more difficult and normally ignored, with the hope that the impressive capability of deep learning architectures to learn (multilingual) representations in the high-resourced setting holds in the low-resourced setting too. In this paper we focus on two African languages, Yor{u}b{a} and Twi, and compare the word embeddings obtained in this way, with word embeddings obtained from curated corpora and a language-dependent processing. We analyse the noise in the publicly available corpora, collect high quality and noisy data for the two languages and quantify the improvements that depend not only on the amount of data but on the quality too. We also use different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages. For the evaluation, we manually translate the wordsim-353 word pairs dataset from English into Yor{u}b{a} and Twi. We extend the analysis to contextual word embeddings and evaluate multilingual BERT on a named entity recognition task. For this, we annotate with named entities the Global Voices corpus for Yor{u}b{a}. As output of the work, we provide corpora, embeddings and the test suits for both languages.",
language = "English",
ISBN = "979-10-95546-34-4",
}
贡献
感谢@dadelani添加此数据集。



