barunsaha/aya_dataset_ben_translated
收藏Hugging Face2024-03-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/barunsaha/aya_dataset_ben_translated
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
features:
- name: inputs
dtype: string
- name: targets
dtype: string
- name: language
dtype: string
- name: language_code
dtype: string
- name: annotation_type
dtype: string
- name: user_id
dtype: string
splits:
- name: train
num_bytes: 11918662
num_examples: 6633
- name: test
num_bytes: 308222
num_examples: 250
download_size: 4492541
dataset_size: 12226884
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
task_categories:
- question-answering
language:
- bn
pretty_name: (Subset of) Aya dataset translated to Bengali
size_categories:
- 1K<n<10K
---
`aya_dataset_ben_translated` is a subset of the [aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), with some modifications. In particular, the original data points in Bengali (indicated by the `language` or `language_code` columns) are retained. In addition, the English and Hindi data points are translated into Bengali using Google Cloud Translation API. All columns from the original dataset are retained.
A handful of inaccuracies arising out of translation have been fixed so far. Therefore, the dataset can be a bit noisy. This is particularly true for coding related questions and answers. Moreover, some non-Bengali characters can be found in the text. In addition, potential duplicates from the original dataset are retained as well.
提供机构:
barunsaha
原始信息汇总
数据集概述
数据集信息
- 许可证: Apache-2.0
- 特征:
inputs: 类型为字符串targets: 类型为字符串language: 类型为字符串language_code: 类型为字符串annotation_type: 类型为字符串user_id: 类型为字符串
- 分割:
train: 字节数为11918662,样本数为6633test: 字节数为308222,样本数为250
- 下载大小: 4492541字节
- 数据集大小: 12226884字节
- 配置:
default:train: 路径为data/train-*test: 路径为data/test-*
- 任务类别: 问答
- 语言: 孟加拉语
- 名称: (Subset of) Aya dataset translated to Bengali
- 大小类别: 1K<n<10K
数据集描述
aya_dataset_ben_translated 是 aya_dataset 的一个子集,保留了原始的孟加拉语数据点,并将英语和印地语数据点翻译成孟加拉语。数据集可能包含一些翻译导致的错误,特别是在编码相关的问答部分。此外,文本中可能包含一些非孟加拉语字符,并且保留了原始数据集中的潜在重复项。



