sawradip/bn-translation-mega-raw-noisy
收藏Hugging Face2024-03-28 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/sawradip/bn-translation-mega-raw-noisy
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: bn_raw
dtype: string
- name: en_raw
dtype: string
splits:
- name: alt
num_bytes: 10242909
num_examples: 20106
- name: amadercat
num_bytes: 331783
num_examples: 1781
- name: anuvaad
num_bytes: 318844740
num_examples: 1001740
- name: banglanmt
num_bytes: 680601914
num_examples: 2659723
- name: ptb
num_bytes: 659453
num_examples: 1313
- name: ilmpc
num_bytes: 39492292
num_examples: 324366
- name: xlent
num_bytes: 111488860
num_examples: 1616537
- name: nllb
num_bytes: 11409482018
num_examples: 62006746
- name: google
num_bytes: 37571933
num_examples: 191426
- name: bpcc_daily
num_bytes: 1462271
num_examples: 8458
- name: bpcc_icli
num_bytes: 43821553
num_examples: 123766
- name: bpcc_massive
num_bytes: 2233668
num_examples: 16492
- name: bpcc_nllb
num_bytes: 2572847096
num_examples: 13580532
- name: bpcc_samantar_v1
num_bytes: 889529481
num_examples: 2946291
- name: bpcc_samantar_v2
num_bytes: 2549472317
num_examples: 16055075
- name: bpcc_wiki
num_bytes: 18726914
num_examples: 47994
download_size: 9530806631
dataset_size: 18686809202
configs:
- config_name: default
data_files:
- split: alt
path: data/alt-*
- split: amadercat
path: data/amadercat-*
- split: anuvaad
path: data/anuvaad-*
- split: banglanmt
path: data/banglanmt-*
- split: ptb
path: data/ptb-*
- split: ilmpc
path: data/ilmpc-*
- split: xlent
path: data/xlent-*
- split: nllb
path: data/nllb-*
- split: google
path: data/google-*
- split: bpcc_daily
path: data/bpcc_daily-*
- split: bpcc_icli
path: data/bpcc_icli-*
- split: bpcc_massive
path: data/bpcc_massive-*
- split: bpcc_nllb
path: data/bpcc_nllb-*
- split: bpcc_samantar_v1
path: data/bpcc_samantar_v1-*
- split: bpcc_samantar_v2
path: data/bpcc_samantar_v2-*
- split: bpcc_wiki
path: data/bpcc_wiki-*
---
提供机构:
sawradip
原始信息汇总
数据集概述
数据集特征
- bn_raw: 数据类型为字符串
- en_raw: 数据类型为字符串
数据集分割
- alt: 20,106个样本,大小为10,242,909字节
- amadercat: 1,781个样本,大小为331,783字节
- anuvaad: 1,001,740个样本,大小为318,844,740字节
- banglanmt: 2,659,723个样本,大小为680,601,914字节
- ptb: 1,313个样本,大小为659,453字节
- ilmpc: 324,366个样本,大小为39,492,292字节
- xlent: 1,616,537个样本,大小为111,488,860字节
- nllb: 62,006,746个样本,大小为11,409,482,018字节
- google: 191,426个样本,大小为37,571,933字节
- bpcc_daily: 8,458个样本,大小为1,462,271字节
- bpcc_icli: 123,766个样本,大小为43,821,553字节
- bpcc_massive: 16,492个样本,大小为2,233,668字节
- bpcc_nllb: 13,580,532个样本,大小为2,572,847,096字节
- bpcc_samantar_v1: 2,946,291个样本,大小为889,529,481字节
- bpcc_samantar_v2: 16,055,075个样本,大小为2,549,472,317字节
- bpcc_wiki: 47,994个样本,大小为18,726,914字节
数据集大小
- 下载大小: 9,530,806,631字节
- 数据集总大小: 18,686,809,202字节
配置文件
- config_name: default
- data_files: 包含多个分割的数据文件路径



