bot-yaya/undl_zh2en_aligned

Name: bot-yaya/undl_zh2en_aligned
Creator: bot-yaya
Published: 2024-07-09 10:11:44
License: 暂无描述

Hugging Face2024-07-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/bot-yaya/undl_zh2en_aligned

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个来自联合国数字图书馆的段落级中-英对齐平行语料，适用于训练机器翻译模型。数据集包含了源语言和目标语言的文本对，以及相关的评分信息。数据源为人写的文本，已通过argostranslate进行了bleu score评估，评估结果包括不同语言的段落数量、平均标记数以及不同级别的bleu分数。

This is a paragraph-level Chinese-English aligned parallel corpus from the United Nations Digital Library, sourced from human-written content, suitable for training machine translation models. The dataset includes multiple features such as records, source text, destination text, and their ratings. The dataset is divided into a training set, containing a large number of samples and bytes. Additionally, BLEU scores for different languages are provided to evaluate translation quality.

提供机构：

bot-yaya

原始信息汇总

数据集概述

配置

配置名称: default
数据文件:
- 分割: train
- 路径: data/train-*

数据集信息

特征:
- 名称: record
  - 数据类型: string
- 名称: clean_para_index_set_pair
  - 数据类型: string
- 名称: src
  - 数据类型: string
- 名称: dst
  - 数据类型: string
- 名称: src_text
  - 数据类型: string
- 名称: dst_text
  - 数据类型: string
- 名称: src_rate
  - 数据类型: float64
- 名称: dst_rate
  - 数据类型: float64

分割

名称: train
- 字节数: 8884444751
- 样本数: 15331650

大小

下载大小: 2443622169
数据集大小: 8884444751

5,000+

优质数据集

54 个

任务类型

进入经典数据集