liaad/machine_translation_dataset_detokenized
收藏Hugging Face2024-04-08 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/liaad/machine_translation_dataset_detokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: journalistic
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 1283261148
num_examples: 1845205
download_size: 864052343
dataset_size: 1283261148
- config_name: legal
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': PT-PT
'1': PT-BR
splits:
- name: train
num_bytes: 148927683
num_examples: 477903
download_size: 91110976
dataset_size: 148927683
- config_name: literature
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 55646572
num_examples: 225
download_size: 19697267
dataset_size: 55646572
- config_name: politics
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 367487667
num_examples: 14328
download_size: 200081078
dataset_size: 367487667
- config_name: social_media
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': pt-PT
'1': pt-BR
splits:
- name: train
num_bytes: 371972738
num_examples: 3074774
download_size: 266674007
dataset_size: 371972738
- config_name: web
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': PT-PT
'1': PT-BR
splits:
- name: train
num_bytes: 1372865174
num_examples: 279555
download_size: 705408533
dataset_size: 1372865174
configs:
- config_name: journalistic
data_files:
- split: train
path: journalistic/train-*
- config_name: legal
data_files:
- split: train
path: legal/train-*
- config_name: literature
data_files:
- split: train
path: literature/train-*
- config_name: politics
data_files:
- split: train
path: politics/train-*
- config_name: social_media
data_files:
- split: train
path: social_media/train-*
- config_name: web
data_files:
- split: train
path: web/train-*
---
提供机构:
liaad
原始信息汇总
数据集概述
1. 配置名称与特征
1.1 新闻(journalistic)
- 特征:
- text: 数据类型为字符串
- label: 分类标签,包含两种语言代码 0: pt-PT, 1: pt-BR
- 训练集:
- 字节数: 1283261148
- 示例数: 1845205
- 下载大小: 864052343
- 数据集大小: 1283261148
1.2 法律(legal)
- 特征:
- text: 数据类型为字符串
- label: 分类标签,包含两种语言代码 0: PT-PT, 1: PT-BR
- 训练集:
- 字节数: 148927683
- 示例数: 477903
- 下载大小: 91110976
- 数据集大小: 148927683
1.3 文学(literature)
- 特征:
- text: 数据类型为字符串
- label: 分类标签,包含两种语言代码 0: pt-PT, 1: pt-BR
- 训练集:
- 字节数: 55646572
- 示例数: 225
- 下载大小: 19697267
- 数据集大小: 55646572
1.4 政治(politics)
- 特征:
- text: 数据类型为字符串
- label: 分类标签,包含两种语言代码 0: pt-PT, 1: pt-BR
- 训练集:
- 字节数: 367487667
- 示例数: 14328
- 下载大小: 200081078
- 数据集大小: 367487667
1.5 社交媒体(social_media)
- 特征:
- text: 数据类型为字符串
- label: 分类标签,包含两种语言代码 0: pt-PT, 1: pt-BR
- 训练集:
- 字节数: 371972738
- 示例数: 3074774
- 下载大小: 266674007
- 数据集大小: 371972738
1.6 网络(web)
- 特征:
- text: 数据类型为字符串
- label: 分类标签,包含两种语言代码 0: PT-PT, 1: PT-BR
- 训练集:
- 字节数: 1372865174
- 示例数: 279555
- 下载大小: 705408533
- 数据集大小: 1372865174
2. 数据文件路径
- 新闻(journalistic): journalistic/train-*
- 法律(legal): legal/train-*
- 文学(literature): literature/train-*
- 政治(politics): politics/train-*
- 社交媒体(social_media): social_media/train-*
- 网络(web): web/train-*



