Thaweewat/alpaca-cleaned-52k-th

Name: Thaweewat/alpaca-cleaned-52k-th
Creator: Thaweewat
Published: 2023-05-09 16:18:02
License: 暂无描述

Hugging Face2023-05-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Thaweewat/alpaca-cleaned-52k-th

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 task_categories: - question-answering - summarization tags: - instruction-finetuning language: - th size_categories: - 10K<n<100K --- # Summary This is a Thai 🇹🇭-instructed dataset translated from cleaned version of the original Alpaca Dataset released by Stanford using Google Cloud Translation, contain 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The following issues have been identified in the original release and fixed in this dataset: 1. **Hallucinations:** Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer. 2. **Merged Instructions:** There were many instructions that were merged together in the original dataset for some reason. 3. **Empty outputs:** Some entries in the original dataset had empty outputs. 4. **Empty code examples:** Some descriptions in the original dataset were missing code examples, making it difficult to understand the intended behavior of the code. 5. **Instructions to generate images:** Some descriptions in the original dataset included instructions to generate images, something obviously not possible. 6. **N/A outputs:** Some code snippets in the original dataset had N/A outputs. 7. **Inconsistent input field:** The original dataset had inconsistent usage of the input field when it was supposed to be empty. 8. **Wrong answers:** Some instructions/questions in the original dataset had incorrect answers. About 80% of the math problems are estimated to have incorrect answers. 9. **Non-Sensical/Unclear instructions:** Many instructions are unclear, we try to clarify (or re-write) if instructions are non-sensical. Instructions that are slightly unclear, but where one could deduce the meaning are not altered. 10. **Extraneous escape and control characters:** The original dataset had several entries with extraneous escape and control characters. ### Original Alpaca Dataset Summary Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications: - The `text-davinci-003` engine to generate the instruction data instead of `davinci`. - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`. - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation. - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions. - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct. The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications: - The `text-davinci-003` engine to generate the instruction data instead of `davinci`. - A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`. - Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation. - The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions. - Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct. This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, the authors also found that the 52K generated data to be much more diverse than the data released by [Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl). Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation Languages: Thai Version: 1.0 ---

提供机构：

Thaweewat

原始信息汇总

数据集概述

基本信息

许可证: cc-by-sa-3.0
任务类别:
- 问答
- 摘要
标签: 指令微调
语言: 泰语
数据集大小: 10K<n<100K

数据集内容

来源: 由斯坦福大学发布的原始Alpaca数据集经过清理并通过Google Cloud Translation翻译成泰语。
生成方式: 使用OpenAI的text-davinci-003引擎生成52,000条指令和演示。
用途: 用于语言模型的指令微调，使语言模型更好地遵循指令。

数据集改进

问题修复:
- 原数据集中的幻觉问题已解决。
- 合并的指令已分开。
- 空输出已填充。
- 缺少代码示例的描述已补充。
- 无法实现的图像生成指令已移除。
- N/A输出已处理。
- 输入字段的不一致使用已修正。
- 错误的答案已更正。
- 非理性的/不清晰的指令已澄清或重写。
- 多余的转义和控制字符已移除。

数据集生成细节

生成引擎: OpenAI的text-davinci-003
生成策略: 使用新的提示和更激进的批量解码，每次生成20条指令，简化数据生成流程，仅生成单个实例而非2到3个。
成本: 生成52K数据集的成本低于$500。
多样性: 比Self-Instruct发布的数据更丰富多样。

支持的任务

训练LLMs
合成数据生成
数据增强

版本

版本: 1.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集