saillab/alpaca_bambara_taco

Name: saillab/alpaca_bambara_taco
Creator: saillab
Published: 2024-09-20 22:08:12
License: 暂无描述

Hugging Face2024-09-20 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/saillab/alpaca_bambara_taco

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - bm pretty_name: Bambara alpaca-52k size_categories: - 100K<n<1M --- This repository contains the dataset used for the TaCo paper. The dataset follows the style outlined in the TaCo paper, as follows: ``` { "instruction": "instruction in xx", "input": "input in xx", "output": "Instruction in English: instruction in en , Response in English: response in en , Response in xx: response in xx " } ``` Please refer to the paper for more details: [OpenReview](https://openreview.net/forum?id=02MLWBj8HP) If you have used our dataset, please cite it as follows: **Citation** ``` @inproceedings{upadhayay2024taco, title={TaCo: Enhancing Cross-Lingual Transfer for Low-Resource Languages in {LLM}s through Translation-Assisted Chain-of-Thought Processes}, author={Bibek Upadhayay and Vahid Behzadan}, booktitle={5th Workshop on practical ML for limited/low resource settings, ICLR}, year={2024}, url={https://openreview.net/forum?id=02MLWBj8HP} } ``` The original dataset [(Alpaca-52K)](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release) was translated using Google Translate. **Copyright and Intended Use** This dataset has been released under CC BY-NC, intended for academic and research purposes only. Please review the licenses and terms and conditions of Alpaca-52K, Dolly-15K, and Google Cloud Translation before using this dataset for any purpose other than research.

语言： - 班巴拉语（Bambara）数据集展示名称：班巴拉语Alpaca-52K 规模分类： - 10万<样本量<100万本仓库收录了TaCo论文所使用的数据集。该数据集遵循TaCo论文中规定的格式规范，具体示例如下： { "instruction": "xx语言指令", "input": "xx语言输入内容", "output": "英文指令：英文指令原文，英文回复：英文回复内容，班巴拉语回复：班巴拉语回复内容" } 更多详细信息请参阅该论文：[OpenReview](https://openreview.net/forum?id=02MLWBj8HP) 若您使用了本数据集，请按如下格式引用： **引用格式** @inproceedings{upadhayay2024taco, title={TaCo: 通过翻译辅助思维链流程增强大语言模型（LLM）中低资源语言的跨语言迁移性能}, author={Bibek Upadhayay and Vahid Behzadan}, booktitle={第五届有限/低资源场景下实用机器学习研讨会，国际学习表征大会（ICLR）}, year={2024}, url={https://openreview.net/forum?id=02MLWBj8HP} } 原始数据集[(Alpaca-52K)](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release)通过谷歌翻译（Google Translate）完成译制。 **版权与使用范围** 本数据集采用CC BY-NC协议发布，仅用于学术与研究用途。在将本数据集用于研究以外的任何用途前，请务必查阅Alpaca-52K、Dolly-15K以及谷歌云翻译（Google Cloud Translation）的许可协议与条款细则。

提供机构：

saillab

原始信息汇总

数据集概述

数据集特征

instruction：数据类型为字符串。
input：数据类型为字符串。
output：数据类型为字符串。
id：数据类型为字符串。
text：数据类型为字符串。

数据集划分

训练集：包含49601个样本，总大小为185277525.9847908字节。
测试集：包含12401个样本，总大小为46322183.01520918字节。

数据集大小

下载大小：111522127字节。
数据集总大小：231599709.0字节。

数据文件配置

配置名称：default
训练数据路径：data/train-*
测试数据路径：data/test-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集