ajesujoba/twi_text_c3

Name: ajesujoba/twi_text_c3
Creator: ajesujoba
Published: 2024-01-18 11:17:37
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/ajesujoba/twi_text_c3

下载链接

链接失效反馈

官方服务：

资源简介：

Twi Text C3数据集是一个用于比较预训练词嵌入（如Fasttext）和在Twi文本上训练的嵌入的数据集。数据集包含从多个网络来源（如圣经、JW300、维基百科等）收集的文本，既有干净的文本（如圣经），也有包含错误拼写和混合方言的噪声文本。数据集的主要用途是训练Twi文本的词嵌入和语言模型。数据集的语言是Twi，且仅包含训练集。数据集的创建目的是为Twi这一新语言引入资源。数据集的来源包括Jehovah Witness、Twi Bible和Yorùbá Wikipedia等。数据集的注释和标注过程未详细说明。数据集存在宗教领域的偏见，因为包含了JW300和圣经的内容。数据集的创建者是Saarland University的学生，数据集采用Creative Commons Attribution-NonCommercial 4.0许可。

提供机构：

ajesujoba

原始信息汇总

数据集概述

数据集描述

数据集摘要

Twi Text C3 数据集是从网络上的多个来源（如圣经、JW300、维基百科等）收集的，用于比较预训练的词嵌入（Fasttext）和基于精选Twi文本训练的嵌入。该数据集包含干净的文本（如圣经）和带有不正确拼写和混合方言的噪声文本。

支持的任务和排行榜

该数据集主要用于训练Twi文本的词嵌入和语言模型。

语言

支持的语言是Twi。

数据集结构

数据实例

每个数据点是一个句子，例如： json { "text": "mfitiaseɛ no onyankopɔn bɔɔ ɔsoro ne asaase" }

数据字段

text：一个字符串特征，每行代表一个句子。

数据分割

仅包含训练集分割。

数据集创建

策划理由

创建该数据集的目的是为了帮助引入新的语言资源——Twi。

源数据

初始数据收集和规范化

数据集来自网络上的多个来源：圣经、JW300和维基百科。具体的数据摘要和统计信息请参见论文中的表1。

源语言生产者

Jehovah Witness (JW300)
Twi Bible
Yorùbá Wikipedia

注释

注释过程

[更多信息需补充]

注释者

[更多信息需补充]

个人和敏感信息

[更多信息需补充]

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

偏见讨论

由于包含JW300和圣经，数据集偏向于宗教领域（基督教）。

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

数据集由Kwabena Amponsah-Kaakyire、Jesujoba Alabi和David Adelani策展，他们是德国萨尔兰大学的在读学生。

许可信息

数据集采用Creative Commons Attribution-NonCommercial 4.0许可。

引用信息

bibtex @inproceedings{alabi-etal-2020-massive, title = "Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of {Y}or{u}b{a} and {T}wi", author = "Alabi, Jesujoba and Amponsah-Kaakyire, Kwabena and Adelani, David and Espa{~n}a-Bonet, Cristina", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.335", pages = "2754--2762", abstract = "The success of several architectures to learn semantic representations from unannotated text and the availability of these kind of texts in online multilingual resources such as Wikipedia has facilitated the massive and automatic creation of resources for multiple languages. The evaluation of such resources is usually done for the high-resourced languages, where one has a smorgasbord of tasks and test sets to evaluate on. For low-resourced languages, the evaluation is more difficult and normally ignored, with the hope that the impressive capability of deep learning architectures to learn (multilingual) representations in the high-resourced setting holds in the low-resourced setting too. In this paper we focus on two African languages, Yor{u}b{a} and Twi, and compare the word embeddings obtained in this way, with word embeddings obtained from curated corpora and a language-dependent processing. We analyse the noise in the publicly available corpora, collect high quality and noisy data for the two languages and quantify the improvements that depend not only on the amount of data but on the quality too. We also use different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages. For the evaluation, we manually translate the wordsim-353 word pairs dataset from English into Yor{u}b{a} and Twi. We extend the analysis to contextual word embeddings and evaluate multilingual BERT on a named entity recognition task. For this, we annotate with named entities the Global Voices corpus for Yor{u}b{a}. As output of the work, we provide corpora, embeddings and the test suits for both languages.", language = "English", ISBN = "979-10-95546-34-4", }

贡献

感谢@dadelani添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集