mbazaNLP/NMT_Education_parallel_data_en_kin

Name: mbazaNLP/NMT_Education_parallel_data_en_kin
Creator: mbazaNLP
Published: 2023-09-11 13:23:44
License: 暂无描述

Hugging Face2023-09-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mbazaNLP/NMT_Education_parallel_data_en_kin

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集旨在开发一个用于基尼亚卢旺达语和英语之间双向翻译的机器翻译模型，特别是为Atingi学习平台。数据格式为TSV，模型为huggingface的mbazaNLP/Nllb_finetuned_education_en_kin。数据实例展示了具体的翻译对，数据字段包括id、source_id、source、phrase等。数据分割为训练数据、验证数据和测试数据。数据预处理包括数据分割和测试集的创建。数据收集过程涉及从多个网站抓取英语句子，数据来源包括Coursera、Atingi和Wikipedia。数据集创建过程中，人类翻译员被雇佣来翻译收集到的句子，并通过验证评分来确保翻译质量。

This dataset is designed to develop a machine translation model for bidirectional translation between Kinyarwanda and English, specifically for the Atingi learning platform. The data is stored in TSV format, and the utilized model is Hugging Face's mbazaNLP/Nllb_finetuned_education_en_kin. Dataset instances present specific translation pairs, with the data fields including id, source_id, source, phrase, and more. The dataset is partitioned into training, validation, and test subsets. Data preprocessing includes dataset splitting and test set creation. The data collection procedure involves scraping English sentences from multiple websites, with sources covering Coursera, Atingi, and Wikipedia. During the dataset creation process, human translators were contracted to translate the collected sentences, and validation scoring was implemented to ensure translation quality.

提供机构：

mbazaNLP

原始信息汇总

数据集描述

该数据集旨在开发一个机器翻译模型，用于教育领域的句子在英语和卢旺达语之间的双向翻译，特别是为Atingi学习平台设计。

数据格式： TSV
模型： huggingface 模型链接

数据集概述

数据实例

118347 103384 And their ideas was that the teachers just didnt care and had no time for them. Kandi igitekerezo cyabo nuko abarimu batabitayeho gusa kandi ntibabone umwanya. 2023-06-25 09:40:28 223 1 3 education coursera 72-93

数据字段

id
source_id
source
phrase
timestamp
user_id
validation_state
validation_score
domain
source_files
str_ranges

数据分割

训练数据： 58251
验证数据： 2456
测试数据： 1060

数据预处理

数据分割： 为了创建测试集，所有数据源在测试数据集中贡献的句子数量是均等的。测试集的句子长度分布与整个数据集的句子长度分布相似。在选择测试集后，使用sklearn的train_test_split从剩余数据中分割训练和验证数据。

数据收集

数据收集过程： 通过网络爬虫从包含英语句子的多个网站获取单语源句子。
数据来源：
- Coursera
- Atingi
- Wikipedia

数据集创建

在收集单语数据集后，雇佣人工翻译人员为收集的句子生成翻译。为了确保质量，每个句子被翻译多次，并为每个生成的翻译分配validation_score，用于选择最佳翻译。测试数据集进一步修订，以移除或纠正有错误的翻译句子。

5,000+

优质数据集

54 个

任务类型

进入经典数据集