bot-yaya/rework_undl_text
收藏Hugging Face2024-07-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bot-yaya/rework_undl_text
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从联合国数字图书馆ODS爬取的平行语料,时间跨度为2000-2023年。数据源链接可能存在的更新不一致情况。数据预处理步骤包括去噪和文本转换。旧版数据链接由于参数不当处理时丢失了一部分数据,因此建议使用新版数据。
The dataset contains parallel corpus crawled from the United Nations Digital Library ODS, covering the period from 2000 to 2023. It includes texts in multiple languages such as Arabic (ar), Chinese (zh), English (en), French (fr), Russian (ru), Spanish (es), and German (de), along with a record field (record). The dataset is divided into a training set (train) with 165,840 samples, totaling 48,622,457,871 bytes. The download size of the dataset is 3,906,189,450 bytes. The texts in the dataset were converted to plain text format using the pandoc tool and may require further noise reduction steps, such as removing separators and table elements, for use in translation and alignment tasks.
提供机构:
bot-yaya
原始信息汇总
数据集概述
配置
- 配置名称: default
- 数据文件:
- 分割: train
- 路径: data/train-*
数据集信息
-
特征:
- 名称: ar
- 数据类型: string
- 名称: zh
- 数据类型: string
- 名称: en
- 数据类型: string
- 名称: fr
- 数据类型: string
- 名称: ru
- 数据类型: string
- 名称: es
- 数据类型: string
- 名称: de
- 数据类型: string
- 名称: record
- 数据类型: string
- 名称: ar
-
分割:
- 名称: train
- 字节数: 48622457871
- 样本数: 165840
- 名称: train
-
下载大小: 3906189450
-
数据集大小: 48622457871



