Helsinki-NLP/un_ga

Name: Helsinki-NLP/un_ga
Creator: Helsinki-NLP
Published: 2024-04-02 13:20:41
License: 暂无描述

Hugging Face2024-04-02 更新2024-04-20 收录

下载链接：

https://hf-mirror.com/datasets/Helsinki-NLP/un_ga

下载链接

链接失效反馈

官方服务：

资源简介：

UnGa数据集是一个多语言翻译数据集，包含多种语言对的翻译数据，如阿拉伯语到英语、阿拉伯语到西班牙语等。该数据集由联合国的翻译文档组成，最初由Alexandre Rafalovitch和Robert Dale编译成翻译记忆库。数据集已被弃用，推荐使用官方的联合国平行语料库。

The UnGa Dataset is a multilingual translation dataset containing translation data across multiple language pairs, such as Arabic to English, Arabic to Spanish, and others. It is composed of United Nations translation documents, and was originally compiled into a translation memory by Alexandre Rafalovitch and Robert Dale. This dataset has been deprecated, and the official United Nations Parallel Corpus is recommended instead.

提供机构：

Helsinki-NLP

原始信息汇总

数据集概述

基本信息

数据集名称: UnGa
多语言支持: 包含阿拉伯语(ar)、英语(en)、西班牙语(es)、法语(fr)、俄语(ru)、中文(zh)
许可证: 未知
数据集大小: 10K<n<100K
任务类型: 翻译

数据集结构

配置名称:
- ar-to-en, ar-to-es, ar-to-fr, ar-to-ru, ar-to-zh
- en-to-es, en-to-fr, en-to-ru, en-to-zh
- es-to-fr, es-to-ru, es-to-zh
- fr-to-ru, fr-to-zh
- ru-to-zh
特征:
- id: 字符串类型
- translation: 包含源语言和目标语言
分割:
- 训练集: 每个配置的训练集大小和示例数量不同，总字节数从48217579到72657625不等

数据集详情

训练集大小:
- ar_to_en: 74067个示例，53122776字节
- ar_to_es: 74067个示例，55728615字节
- ar_to_fr: 74067个示例，55930802字节
- ar_to_ru: 74067个示例，72657625字节
- ar_to_zh: 74067个示例，48217579字节
- en_to_es: 74067个示例，45358770字节
- en_to_fr: 74067个示例，45560957字节
- en_to_ru: 74067个示例，62287780字节
- en_to_zh: 74067个示例，37847734字节
- es_to_fr: 74067个示例，48166796字节
- es_to_ru: 74067个示例，64893619字节
- es_to_zh: 74067个示例，40453573字节
- fr_to_ru: 74067个示例，65095806字节
- fr_to_zh: 74067个示例，40655760字节
- ru_to_zh: 74067个示例，57382583字节

注意事项

数据集已弃用，建议使用官方的联合国平行语料库代替。

5,000+

优质数据集

54 个

任务类型

进入经典数据集