Thaweewat/alpaca-cleaned-52k-th
收藏Hugging Face2023-05-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Thaweewat/alpaca-cleaned-52k-th
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
task_categories:
- question-answering
- summarization
tags:
- instruction-finetuning
language:
- th
size_categories:
- 10K<n<100K
---
# Summary
This is a Thai 🇹🇭-instructed dataset translated from cleaned version of the original Alpaca Dataset released by Stanford using Google Cloud Translation, contain 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine.
This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.
The following issues have been identified in the original release and fixed in this dataset:
1. **Hallucinations:** Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.
2. **Merged Instructions:** There were many instructions that were merged together in the original dataset for some reason.
3. **Empty outputs:** Some entries in the original dataset had empty outputs.
4. **Empty code examples:** Some descriptions in the original dataset were missing code examples, making it difficult to understand the intended behavior of the code.
5. **Instructions to generate images:** Some descriptions in the original dataset included instructions to generate images, something obviously not possible.
6. **N/A outputs:** Some code snippets in the original dataset had N/A outputs.
7. **Inconsistent input field:** The original dataset had inconsistent usage of the input field when it was supposed to be empty.
8. **Wrong answers:** Some instructions/questions in the original dataset had incorrect answers. About 80% of the math problems are estimated to have incorrect answers.
9. **Non-Sensical/Unclear instructions:** Many instructions are unclear, we try to clarify (or re-write) if instructions are non-sensical. Instructions that are slightly unclear, but where one could deduce the meaning are not altered.
10. **Extraneous escape and control characters:** The original dataset had several entries with extraneous escape and control characters.
### Original Alpaca Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.
The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications:
- The `text-davinci-003` engine to generate the instruction data instead of `davinci`.
- A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`.
- Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
- The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions.
- Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct.
The authors built on the data generation pipeline from [Self-Instruct framework](https://github.com/yizhongw/self-instruct) and made the following modifications:
- The `text-davinci-003` engine to generate the instruction data instead of `davinci`.
- A [new prompt](https://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt) was written that explicitly gave the requirement of instruction generation to `text-davinci-003`.
- Much more aggressive batch decoding was used, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
- The data generation pipeline was simplified by discarding the difference between classification and non-classification instructions.
- Only a single instance was generated for each instruction, instead of 2 to 3 instances as in Self-Instruct.
This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500).
In a preliminary study, the authors also found that the 52K generated data to be much more diverse than the data released by [Self-Instruct](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl).
Supported Tasks:
- Training LLMs
- Synthetic Data Generation
- Data Augmentation
Languages: Thai
Version: 1.0
---
提供机构:
Thaweewat
原始信息汇总
数据集概述
基本信息
- 许可证: cc-by-sa-3.0
- 任务类别:
- 问答
- 摘要
- 标签: 指令微调
- 语言: 泰语
- 数据集大小: 10K<n<100K
数据集内容
- 来源: 由斯坦福大学发布的原始Alpaca数据集经过清理并通过Google Cloud Translation翻译成泰语。
- 生成方式: 使用OpenAI的
text-davinci-003引擎生成52,000条指令和演示。 - 用途: 用于语言模型的指令微调,使语言模型更好地遵循指令。
数据集改进
- 问题修复:
- 原数据集中的幻觉问题已解决。
- 合并的指令已分开。
- 空输出已填充。
- 缺少代码示例的描述已补充。
- 无法实现的图像生成指令已移除。
- N/A输出已处理。
- 输入字段的不一致使用已修正。
- 错误的答案已更正。
- 非理性的/不清晰的指令已澄清或重写。
- 多余的转义和控制字符已移除。
数据集生成细节
- 生成引擎: OpenAI的
text-davinci-003 - 生成策略: 使用新的提示和更激进的批量解码,每次生成20条指令,简化数据生成流程,仅生成单个实例而非2到3个。
- 成本: 生成52K数据集的成本低于$500。
- 多样性: 比Self-Instruct发布的数据更丰富多样。
支持的任务
- 训练LLMs
- 合成数据生成
- 数据增强
版本
- 版本: 1.0



