saillab/alpaca_bengali_taco
收藏Hugging Face2024-09-20 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/saillab/alpaca_bengali_taco
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- bn
pretty_name: Bengali alpaca-52k
size_categories:
- 100K<n<1M
---
This repository contains the dataset used for the TaCo paper.
The dataset follows the style outlined in the TaCo paper, as follows:
```
{
"instruction": "instruction in xx",
"input": "input in xx",
"output": "Instruction in English: instruction in en ,
Response in English: response in en ,
Response in xx: response in xx "
}
```
Please refer to the paper for more details: [OpenReview](https://openreview.net/forum?id=02MLWBj8HP)
If you have used our dataset, please cite it as follows:
**Citation**
```
@inproceedings{upadhayay2024taco,
title={TaCo: Enhancing Cross-Lingual Transfer for Low-Resource Languages in {LLM}s through Translation-Assisted Chain-of-Thought Processes},
author={Bibek Upadhayay and Vahid Behzadan},
booktitle={5th Workshop on practical ML for limited/low resource settings, ICLR},
year={2024},
url={https://openreview.net/forum?id=02MLWBj8HP}
}
```
The original dataset [(Alpaca-52K)](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release) was translated using Google Translate.
**Copyright and Intended Use**
This dataset has been released under CC BY-NC, intended for academic and research purposes only. Please review the licenses and terms and conditions of Alpaca-52K, Dolly-15K, and Google Cloud Translation before using this dataset for any purpose other than research.
language:
- 孟加拉语(Bengali)
pretty_name: 孟加拉语Alpaca-52k
size_categories:
- 10万<数据量<100万
---
本仓库包含用于TaCo论文所使用的数据集。
本数据集遵循TaCo论文中规定的格式,具体如下:
{
"instruction": "xx语指令",
"input": "xx语输入",
"output": "英文指令:英文指令内容,
英文回复:英文回复内容,
xx语回复:xx语回复内容"
}
如需了解更多细节,请参阅该论文:[OpenReview](https://openreview.net/forum?id=02MLWBj8HP)
若您使用了本数据集,请按照以下格式引用:
**引用格式**
@inproceedings{upadhayay2024taco,
title={TaCo: 通过翻译辅助思维链(Chain-of-Thought)流程增强大语言模型(LLM)中的低资源语言跨语言迁移},
author={Bibek Upadhayay 与 Vahid Behzadan},
booktitle={第5届面向低资源/受限资源场景实践机器学习研讨会,国际学习表征会议(ICLR)},
year={2024},
url={https://openreview.net/forum?id=02MLWBj8HP}
}
原始数据集[(Alpaca-52K)](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release) 已通过谷歌翻译(Google Translate)完成翻译。
**版权与使用意图
本数据集采用CC BY-NC协议发布,仅可用于学术与研究目的。在将本数据集用于研究以外的任何用途之前,请务必查阅Alpaca-52K、Dolly-15K以及谷歌云翻译(Google Cloud Translation)的许可协议与相关条款与细则。
提供机构:
saillab
原始信息汇总
数据集概述
数据集特征
- instruction:数据类型为字符串。
- input:数据类型为字符串。
- output:数据类型为字符串。
- id:数据类型为字符串。
- text:数据类型为字符串。
数据集分割
- 训练集:包含49601个样本,总大小为277420025.780249字节。
- 测试集:包含12401个样本,总大小为69359201.21975097字节。
数据集大小
- 下载大小:143428503字节。
- 数据集总大小:346779227.0字节。
数据文件配置
- 默认配置:
- 训练集路径:
data/train-* - 测试集路径:
data/test-*
- 训练集路径:



