SPRINGLab/BPCC_cleaned
收藏Hugging Face2024-11-12 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/SPRINGLab/BPCC_cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: src_lang
dtype:
class_label:
names:
'0': eng_Latn
'1': ben_Beng
'2': guj_Gujr
'3': hin_Deva
'4': kan_Knda
'5': mal_Mlym
'6': mar_Deva
'7': tam_Taml
'8': tel_Telu
- name: tgt_lang
dtype:
class_label:
names:
'0': eng_Latn
'1': ben_Beng
'2': guj_Gujr
'3': hin_Deva
'4': kan_Knda
'5': mal_Mlym
'6': mar_Deva
'7': tam_Taml
'8': tel_Telu
- name: src_text
dtype: string
- name: tgt_text
dtype: string
- name: score
dtype: float64
splits:
- name: train
num_bytes: 1698080949
num_examples: 3990964
download_size: 817154884
dataset_size: 1698080949
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- translation
language:
- bn
- hi
- ta
- te
- mr
- ml
- kn
- gu
size_categories:
- 1M<n<10M
---
A curated subset of [Bharat Parallel Corpus Collection](https://bpcc.ai4bharat.org/) (BPCC) for 8 Indian languages.
Translation pairs are filtered with LABSE score(>0.9) and further preprocessed.
Useful for training high-quality translation models.
提供机构:
SPRINGLab
原始信息汇总
数据集概述
数据集特征
-
src_lang: 源语言代码,分类标签包括:
- 0: eng_Latn
- 1: ben_Beng
- 2: guj_Gujr
- 3: hin_Deva
- 4: kan_Knda
- 5: mal_Mlym
- 6: mar_Deva
- 7: tam_Taml
- 8: tel_Telu
-
tgt_lang: 目标语言代码,分类标签与src_lang相同。
-
src_text: 源文本,数据类型为字符串。
-
tgt_text: 目标文本,数据类型为字符串。
-
score: 评分,数据类型为float64。
数据集分割
- train: 训练集
- 数据量: 1698080949 字节
- 示例数量: 3990964
数据集大小
- 下载大小: 817154884 字节
- 数据集大小: 1698080949 字节



